Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HumanEvalFix integration #1908

Merged
merged 26 commits into from
May 23, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
677aec1
Preliminary HumanEvalFix integration
Muennighoff May 20, 2024
931e0fd
Clean paths
Muennighoff May 20, 2024
dc99ac2
fix: set workspace path correctly for config
xingyaoww May 21, 2024
4c9c9db
add missing run_infer.sh
xingyaoww May 21, 2024
de41c3e
update run_infer w/o hard coded agent
xingyaoww May 21, 2024
ba32bfb
fix typo
xingyaoww May 21, 2024
7964b81
change `instance_id` to `task_id`
xingyaoww May 21, 2024
6af2898
add the warning and env var setting to run_infer.sh
xingyaoww May 21, 2024
4ae0d36
reset back workspace mount at the end of each instance
xingyaoww May 21, 2024
01e9b54
10 max iter is probably enough for humanevalfix
xingyaoww May 21, 2024
114d5b7
Remove unneeded section
Muennighoff May 21, 2024
02c522c
Merge branch 'main' into humanevalfix
Muennighoff May 21, 2024
cacb3df
Fix link
Muennighoff May 22, 2024
4d32536
Use logger
Muennighoff May 22, 2024
de5c37f
Update run_infer.py
tangxiangru May 22, 2024
116ff4d
Update README.md
tangxiangru May 22, 2024
a551b63
Update README.md
tangxiangru May 22, 2024
6127082
Update README.md
tangxiangru May 22, 2024
20d62a8
Update README.md
tangxiangru May 22, 2024
f890484
Update README.md
tangxiangru May 22, 2024
c715e90
Update pyproject.toml
tangxiangru May 22, 2024
3e18094
Delete poetry.lock
tangxiangru May 22, 2024
3aab77d
update poetry.lock
tangxiangru May 22, 2024
0679320
Update README.md
tangxiangru May 22, 2024
60709ca
Update README.md
tangxiangru May 22, 2024
66b1c4b
Merge branch 'main' into humanevalfix
xingyaoww May 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ all the preprocessing/evaluation/analysis scripts.
## Supported Benchmarks

- SWE-Bench: [`evaluation/swe_bench`](./swe_bench)
- HumanEvalFix: [`evaluation/humanevalfix`](./humanevalfix)

### Result Visualization

Expand Down
46 changes: 46 additions & 0 deletions evaluation/humanevalfix/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# HumanEvalFix Evaluation with OpenDevin

Implements evaluation of agents on HumanEvalFix from the HumanEvalPack benchmark introduced in [OctoPack: Instruction Tuning Code Large Language Models](https://arxiv.org/abs/2308.07124). Please see https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py for the reference implementation used in the paper.
Muennighoff marked this conversation as resolved.
Show resolved Hide resolved

## Setup Environment

Please follow [this document](https://github.com/OpenDevin/OpenDevin/blob/main/Development.md) to setup local develop environment for OpenDevin.

In addition, evaluation requires the `evaluate` package installable via:
```bash
pip install evaluate
```
yufansong marked this conversation as resolved.
Show resolved Hide resolved

## Configure OpenDevin and your LLM

Create a `config.toml` file if it does not exist at the root of the workspace.

Add the following configurations:

```toml
[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
Copy link
Member

@li-boxuan li-boxuan May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably also want to include enable_auto_lint = true. Evaluation of CodeActAgent on SWE-bench-lite shows that this option could give the LLM a hint of indentation errors, and thus boosts the final score (if the language is python).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@li-boxuan fixed


# TODO: Change these to the model you want to evaluate
[eval_gpt4_1106_preview]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0

[eval_some_openai_compatible_model]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
```

## Run Inference on HumanEvalFix

```bash
./evaluation/humanevalfix/scripts/run_infer.sh eval_gpt4_1106_preview
li-boxuan marked this conversation as resolved.
Show resolved Hide resolved
```

You can replace `eval_gpt4_1106_preview` with any model you set up in `config.toml`.

Empty file.
Loading