adding a script to fetch and convert devin's output for evaluation#81
Conversation
There was a problem hiding this comment.
How about we put this file to SWE-Bench/scripts?
There was a problem hiding this comment.
I'm not quite sure about this. It seems more reasonable to keep dataset related files in the dataset folder to me. @JustinLin610 @libowen2121 any thoughts on this?
There was a problem hiding this comment.
Ohh i suggest we do this: mv src/prepare_devin_outputs_for_evaluation.py scripts/prepare_devin_outputs_for_evaluation.py
There was a problem hiding this comment.
oh my bad, I thought we are moving it outside the evaluation folder. Will do
| - `devin_eval_analysis.ipynb`: notebook analyzing devin's outputs | ||
| - src | ||
| - `prepare_devin_outputs_for_evaluation.py`: script fetching and converting devin's output into the desired json file for evaluation. | ||
| - outputs: two json files under `evaluation/SWE-bench/data/` that can be directly used for evaluation |
There was a problem hiding this comment.
Can you upload the post-processed file to our huggingface datasets, and add curl or wget command here so people can directly download those for debugging? You can request to join if you haven't already: https://huggingface.co/OpenDevin
|
|
||
| with open(os.path.join(output_dir, "fail_output.json"), "w") as fail_file: | ||
| json.dump(failed_files_info, fail_file, indent=4) | ||
|
|
There was a problem hiding this comment.
I'm debating whether we want to make this two separate files, or just one file -- how about we merge them into one, and add an additional bool field like devin_pass?
There was a problem hiding this comment.
It only takes ~1 minute to fetch and process the files. The purpose of having two files is you can directly start from the passed files for pilot testing. I can generate another merged file and upload it to HF
There was a problem hiding this comment.
having both options is a good option! maybe we can add an argument in the script to switch that behavior; and we can upload both version to HF and have user decide which one they want to download
* a starting point for SWE-Bench evaluation with docker * fix the swe-bench uid issue * typo fixed * fix conda missing issue * move files based on new PR * Update doc and gitignore using devin prediction file from #81 * fix typo * add a sentence * fix typo in path * fix path --------- Co-authored-by: Binyuan Hui <binyuan.hby@alibaba-inc.com>
…penHands#81) * adding code to fetch and convert devin's output for evaluation * update README.md * update code for fetching and processing devin's outputs * update code for fetching and processing devin's outputs
* a starting point for SWE-Bench evaluation with docker * fix the swe-bench uid issue * typo fixed * fix conda missing issue * move files based on new PR * Update doc and gitignore using devin prediction file from OpenHands#81 * fix typo * add a sentence * fix typo in path * fix path --------- Co-authored-by: Binyuan Hui <binyuan.hby@alibaba-inc.com>
Co-authored-by: openhands <openhands@all-hands.dev>
Swaps DefaultUserAuth with CognitoUserAuth.
Swaps FileSettingsStore with CognitoS3SettingsStore.
Swaps FileSecretsStore with CognitoS3SecretsStore.
All for multi-tenant user isolation.
Custom modules (cognito_user_auth.py, s3_settings_store.py, s3_secrets_store.py)
are dropped into /app/openhands/app_server/{user_auth,settings,secrets}/ at
Docker build time by openhands-infra/docker/Dockerfile (PR OpenHands#81).
V1 port of v1.6.0-fargate commit 00130ab. server_config.py is V0-tagged
upstream but the settings_store_class / secret_store_class / user_auth_class
fields are still active in v1.7.0 — they drive get_impl() in shared.py to
resolve the configured V1 ABC subclasses.
Refs: zxkane/openhands-infra#81
The Fargate sandbox orchestrator stamps each DynamoDB sandbox record with USER_ID from the start request environment so OpenResty can later enforce cross-user runtime authorization (the runtime subdomain proxy checks that the requesting user matches the sandbox owner). Upstream RemoteSandboxService.start_sandbox knows the user_id (it stores it as created_by_user_id) but never forwards it into the /start environment. Result: DDB user_id="anonymous", OpenResty ownership check is skipped, and any authenticated user can hit any runtime URL. Inject environment["USER_ID"] = user_id right after _init_environment returns. Fixes the cross-user runtime denial regression observed in PR OpenHands#81 staging E2E (TC-011).
No description provided.