adding a script to fetch and convert devin's output for evaluation by Jiaxin-Pei · Pull Request #81 · OpenHands/OpenHands

Jiaxin-Pei · 2024-03-21T14:13:45Z

No description provided.

xingyaoww · 2024-03-21T14:51:49Z

How about we put this file to SWE-Bench/scripts?

I'm not quite sure about this. It seems more reasonable to keep dataset related files in the dataset folder to me. @JustinLin610 @libowen2121 any thoughts on this?

Ohh i suggest we do this: mv src/prepare_devin_outputs_for_evaluation.py scripts/prepare_devin_outputs_for_evaluation.py

oh my bad, I thought we are moving it outside the evaluation folder. Will do

xingyaoww · 2024-03-21T14:53:49Z

+  - `devin_eval_analysis.ipynb`: notebook analyzing devin's outputs
+- src
+  - `prepare_devin_outputs_for_evaluation.py`: script fetching and converting devin's output into the desired json file for evaluation.
+    - outputs: two json files under `evaluation/SWE-bench/data/` that can be directly used for evaluation


Can you upload the post-processed file to our huggingface datasets, and add curl or wget command here so people can directly download those for debugging? You can request to join if you haven't already: https://huggingface.co/OpenDevin

xingyaoww · 2024-03-21T14:55:01Z

+
+    with open(os.path.join(output_dir, "fail_output.json"), "w") as fail_file:
+        json.dump(failed_files_info, fail_file, indent=4)
+


I'm debating whether we want to make this two separate files, or just one file -- how about we merge them into one, and add an additional bool field like devin_pass?

It only takes ~1 minute to fetch and process the files. The purpose of having two files is you can directly start from the passed files for pilot testing. I can generate another merged file and upload it to HF

having both options is a good option! maybe we can add an argument in the script to switch that behavior; and we can upload both version to HF and have user decide which one they want to download

sounds good!

xingyaoww

LGTM!

* a starting point for SWE-Bench evaluation with docker * fix the swe-bench uid issue * typo fixed * fix conda missing issue * move files based on new PR * Update doc and gitignore using devin prediction file from #81 * fix typo * add a sentence * fix typo in path * fix path --------- Co-authored-by: Binyuan Hui <binyuan.hby@alibaba-inc.com>

…penHands#81) * adding code to fetch and convert devin's output for evaluation * update README.md * update code for fetching and processing devin's outputs * update code for fetching and processing devin's outputs

* a starting point for SWE-Bench evaluation with docker * fix the swe-bench uid issue * typo fixed * fix conda missing issue * move files based on new PR * Update doc and gitignore using devin prediction file from OpenHands#81 * fix typo * add a sentence * fix typo in path * fix path --------- Co-authored-by: Binyuan Hui <binyuan.hby@alibaba-inc.com>

Co-authored-by: openhands <openhands@all-hands.dev>

Swaps DefaultUserAuth with CognitoUserAuth. Swaps FileSettingsStore with CognitoS3SettingsStore. Swaps FileSecretsStore with CognitoS3SecretsStore. All for multi-tenant user isolation. Custom modules (cognito_user_auth.py, s3_settings_store.py, s3_secrets_store.py) are dropped into /app/openhands/app_server/{user_auth,settings,secrets}/ at Docker build time by openhands-infra/docker/Dockerfile (PR OpenHands#81). V1 port of v1.6.0-fargate commit 00130ab. server_config.py is V0-tagged upstream but the settings_store_class / secret_store_class / user_auth_class fields are still active in v1.7.0 — they drive get_impl() in shared.py to resolve the configured V1 ABC subclasses. Refs: zxkane/openhands-infra#81

The Fargate sandbox orchestrator stamps each DynamoDB sandbox record with USER_ID from the start request environment so OpenResty can later enforce cross-user runtime authorization (the runtime subdomain proxy checks that the requesting user matches the sandbox owner). Upstream RemoteSandboxService.start_sandbox knows the user_id (it stores it as created_by_user_id) but never forwards it into the /start environment. Result: DDB user_id="anonymous", OpenResty ownership check is skipped, and any authenticated user can hit any runtime URL. Inject environment["USER_ID"] = user_id right after _init_environment returns. Fixes the cross-user runtime denial regression observed in PR OpenHands#81 staging E2E (TC-011).

Jiaxin-Pei added 2 commits March 21, 2024 10:10

adding code to fetch and convert devin's output for evaluation

3c1f36b

update README.md

7e95a01

Jiaxin-Pei mentioned this pull request Mar 21, 2024

[Evaluation] Convert Devin's output into SWE-Bench runnable format #80

Closed

xingyaoww reviewed Mar 21, 2024

View reviewed changes

Jiaxin-Pei and others added 3 commits March 21, 2024 13:10

Merge branch 'OpenDevin:main' into main

b509c69

update code for fetching and processing devin's outputs

b55541a

update code for fetching and processing devin's outputs

b4b6786

xingyaoww approved these changes Mar 21, 2024

View reviewed changes

xingyaoww merged commit dc88dac into OpenHands:main Mar 21, 2024

xingyaoww added a commit to xingyaoww/OpenHands that referenced this pull request Mar 22, 2024

Update doc and gitignore using devin prediction file from OpenHands#81

b7bd1d3

malhotra5 pushed a commit that referenced this pull request Mar 25, 2026

Add comprehensive tests for events_to_messages conversions (#81)

969863e

Co-authored-by: openhands <openhands@all-hands.dev>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding a script to fetch and convert devin's output for evaluation#81

adding a script to fetch and convert devin's output for evaluation#81
xingyaoww merged 5 commits into
OpenHands:mainfrom
Jiaxin-Pei:main

Jiaxin-Pei commented Mar 21, 2024

Uh oh!

xingyaoww Mar 21, 2024

Uh oh!

Jiaxin-Pei Mar 21, 2024

Uh oh!

xingyaoww Mar 21, 2024

Uh oh!

Jiaxin-Pei Mar 21, 2024

Uh oh!

xingyaoww Mar 21, 2024

Uh oh!

Jiaxin-Pei Mar 21, 2024

Uh oh!

xingyaoww Mar 21, 2024

Uh oh!

Jiaxin-Pei Mar 21, 2024

Uh oh!

xingyaoww Mar 21, 2024

Uh oh!

Jiaxin-Pei Mar 21, 2024

Uh oh!

xingyaoww left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		with open(os.path.join(output_dir, "fail_output.json"), "w") as fail_file:
		json.dump(failed_files_info, fail_file, indent=4)

Conversation

Jiaxin-Pei commented Mar 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xingyaoww left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants