-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate benchmark and sampler #490
Conversation
While writing this I made the following changes: * benchmark_repos was made a subdir of benchmarks so verify functions could import from generated code. Otherwise python complains it isn't part of a module. * In fact setup_repo does change the current directory. evalute_py and evaluate_sample now change the directory back. * The whole functions have a try now so spurious errors won't crash a run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much! I really like the consolidation here between Sample and Modules. It makes things a lot cleaner. 3 small notes.
benchmarks/benchmark_runner.py
Outdated
@@ -1,3 +1,5 @@ | |||
from __future__ import annotations | |||
|
|||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line needs to be on top to allow this to be run as a script. e.g. got ./benchmarks/benchmark_runner.py
to work instead of requiring python ./benchmarks/benchmark_runner.py
benchmarks/benchmark_runner.py
Outdated
|
||
print("Benchmark:", title) | ||
async def run_benchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion this should be an instance method of Benchmark
named run
.
benchmarks/benchmark_runner.py
Outdated
@@ -116,11 +123,8 @@ async def compare_diffs(actual, generated): | |||
return await grade(prompt, comparison_prompt) | |||
|
|||
|
|||
async def grade_and_clean_diff(repo, response, result, comparison_diff=None): | |||
async def grade_and_clean_diff(diff, response, result, comparison_diff=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't clean anymore. I think we should rename to grade_diff
. I like the change to having setup clean instead of teardown. It's good to have the benchmark repo easily inspectable after a benchmark.
This makes use of the Sampler as much as possible for benchmarks. Basically there's now a
class Benchmark
, which has a list of samples, and it evaluates each withrun_sample
.Pull Request Checklist