Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate benchmark and sampler #490

Merged
merged 12 commits into from
Jan 18, 2024
Merged

Conversation

granawkins
Copy link
Member

This makes use of the Sampler as much as possible for benchmarks. Basically there's now a class Benchmark, which has a list of samples, and it evaluates each with run_sample.

Pull Request Checklist

  • Documentation has been updated, or this change doesn't require that

jakethekoenig and others added 10 commits January 16, 2024 14:31
While writing this I made the following changes:
* benchmark_repos was made a subdir of benchmarks so verify functions
  could import from generated code. Otherwise python complains it isn't
  part of a module.
* In fact setup_repo does change the current directory. evalute_py and
  evaluate_sample now change the directory back.
* The whole functions have a try now so spurious errors won't crash a
  run.
Copy link
Member

@jakethekoenig jakethekoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much! I really like the consolidation here between Sample and Modules. It makes things a lot cleaner. 3 small notes.

@@ -1,3 +1,5 @@
from __future__ import annotations

#!/usr/bin/env python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line needs to be on top to allow this to be run as a script. e.g. got ./benchmarks/benchmark_runner.py to work instead of requiring python ./benchmarks/benchmark_runner.py


print("Benchmark:", title)
async def run_benchmark(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion this should be an instance method of Benchmark named run.

@@ -116,11 +123,8 @@ async def compare_diffs(actual, generated):
return await grade(prompt, comparison_prompt)


async def grade_and_clean_diff(repo, response, result, comparison_diff=None):
async def grade_and_clean_diff(diff, response, result, comparison_diff=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't clean anymore. I think we should rename to grade_diff. I like the change to having setup clean instead of teardown. It's good to have the benchmark repo easily inspectable after a benchmark.

@granawkins granawkins changed the base branch from benchmark-dir-fix to main January 18, 2024 22:41
@granawkins granawkins merged commit fe66c09 into main Jan 18, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants