Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add SWE-bench fullset support #3477

Merged
merged 7 commits into from
Sep 3, 2024
Merged

feat: add SWE-bench fullset support #3477

merged 7 commits into from
Sep 3, 2024

Conversation

xingyaoww
Copy link
Contributor

@xingyaoww xingyaoww commented Aug 19, 2024

What is the problem that this fixes or functionality that this introduces? Does it fix any open issues?


Give a summary of what the PR does, explaining any non-trivial design decisions

  • Support setting the SWE-Bench dataset and split to run evaluation using command line argument.
  • Update the documentation to allow people to directly run_infer.sh without running a pull instance docker first -> run_infer.sh should be able to automatically pull them.
  • Add script with documentation for pushing new SWE-Bench images to remote registry.

Other references

Copy link
Collaborator

@yufansong yufansong Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: does the evaluation need so many images? 🙀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each instance per image, so yes 😢 - that's why we need a good infra to run this at scale

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, that is crazy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xingyaoww . This is exactly what I was looking for...

    if args.set == 'full-test':
        dataset = load_dataset('princeton-nlp/SWE-bench', split='test')
    elif args.set == 'lite-test':
        dataset = load_dataset('princeton-nlp/SWE-bench_Lite', split='test')

It would be awesome if you could add 'princeton-nlp/SWE-bench', split='dev' and 'princeton-nlp/SWE-bench_Lite', split='dev') as well :)

Copy link
Contributor Author

@xingyaoww xingyaoww Sep 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jatinganhotra, I tried to look into this by adding dev set -- but was blocked by princeton-nlp/SWE-bench#199. Will try to add support for dev again once that issue is resolved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being addressed by #3478

@xingyaoww xingyaoww marked this pull request as ready for review September 2, 2024 18:18
@neubig
Copy link
Contributor

neubig commented Sep 3, 2024

Looks good, thanks!

@neubig neubig merged commit d283420 into main Sep 3, 2024
17 checks passed
@neubig neubig deleted the xw/add-swebench-fullset branch September 3, 2024 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants