Integrate benchmark and sampler #490

granawkins · 2024-01-18T08:19:21Z

This makes use of the Sampler as much as possible for benchmarks. Basically there's now a class Benchmark, which has a list of samples, and it evaluates each with run_sample.

Pull Request Checklist

Documentation has been updated, or this change doesn't require that

While writing this I made the following changes: * benchmark_repos was made a subdir of benchmarks so verify functions could import from generated code. Otherwise python complains it isn't part of a module. * In fact setup_repo does change the current directory. evalute_py and evaluate_sample now change the directory back. * The whole functions have a try now so spurious errors won't crash a run.

jakethekoenig

Thanks so much! I really like the consolidation here between Sample and Modules. It makes things a lot cleaner. 3 small notes.

jakethekoenig · 2024-01-18T20:54:06Z

benchmarks/benchmark_runner.py

@@ -1,3 +1,5 @@
+from __future__ import annotations
+
 #!/usr/bin/env python


This line needs to be on top to allow this to be run as a script. e.g. got ./benchmarks/benchmark_runner.py to work instead of requiring python ./benchmarks/benchmark_runner.py

jakethekoenig · 2024-01-18T20:58:35Z

benchmarks/benchmark_runner.py


-    print("Benchmark:", title)
+async def run_benchmark(


In my opinion this should be an instance method of Benchmark named run.

jakethekoenig · 2024-01-18T21:10:24Z

benchmarks/benchmark_runner.py

@@ -116,11 +123,8 @@ async def compare_diffs(actual, generated):
    return await grade(prompt, comparison_prompt)


-async def grade_and_clean_diff(repo, response, result, comparison_diff=None):
+async def grade_and_clean_diff(diff, response, result, comparison_diff=None):


This doesn't clean anymore. I think we should rename to grade_diff. I like the change to having setup clean instead of teardown. It's good to have the benchmark repo easily inspectable after a benchmark.

…-and-sampler

jakethekoenig and others added 10 commits January 16, 2024 14:31

lint

b336991

lint

de830cb

run_sample returns a message_eval and a diff_eval

ddcf38f

rename sampler.run to sampler.run_sample

f7160e7

move run_sample to benchmarks, use in benchmark_runner

2f91bbf

benchmark uses run_sample natively

46ea6c1

add verify() back in

c2fd97a

add verify() back in

f47cdb9

Make sure verify works

6f9469b

jakethekoenig approved these changes Jan 18, 2024

View reviewed changes

granawkins added 2 commits January 19, 2024 05:30

feedback from @jakethekoenig

ab07fbf

Merge remote-tracking branch 'upstream/main' into integrate-benchmark…

9d32bb4

…-and-sampler

granawkins changed the base branch from benchmark-dir-fix to main January 18, 2024 22:41

granawkins merged commit fe66c09 into main Jan 18, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate benchmark and sampler #490

Integrate benchmark and sampler #490

granawkins commented Jan 18, 2024

jakethekoenig left a comment

jakethekoenig Jan 18, 2024

jakethekoenig Jan 18, 2024

jakethekoenig Jan 18, 2024

		@@ -1,3 +1,5 @@
		from __future__ import annotations

		#!/usr/bin/env python

Integrate benchmark and sampler #490

Integrate benchmark and sampler #490

Conversation

granawkins commented Jan 18, 2024

Pull Request Checklist

jakethekoenig left a comment

Choose a reason for hiding this comment

jakethekoenig Jan 18, 2024

Choose a reason for hiding this comment

jakethekoenig Jan 18, 2024

Choose a reason for hiding this comment

jakethekoenig Jan 18, 2024

Choose a reason for hiding this comment