## Estimating average sample efficiency is much faster than computing pass@k

pass@10 for HumanEval+ takes about 10 minutes to compute. If we have thousands of features that we'd like to try using to steer the base model, that's too expensive.

Estimating the sample efficiency of a task means testing how many rollouts it takes to get our *first* pass.

Many examples are also uninteresting:
- If they're already too easy for the base model, and it passes on the first try, then steering can't improve the base model's performance on that task.
- If they're too hard, then it'd require a lot of samples (with an unclear upper bound) to see whether the steered base model has improved.
- This notebook finds problems that are in the middle.

In [25]:
import json
results = json.loads(open("evalplus_results/humaneval/Qwen--Qwen2.5-Coder-1.5B-Instruct_vllm_temp_1.0.eval_results.json").read())
# results = json.loads(open("evalplus_results/mbpp/Qwen--Qwen2.5-Coder-1.5B-Instruct_vllm_temp_1.0.eval_results.json").read())

In [49]:
n_samples_by_task_id = {}

TOO_EASY_THRESHOLD = 2

for task_id, samples in results['eval'].items():
    for n, attempt in enumerate(samples):
        n += 1  # make n_samples 1-indexed
        if attempt['base_status'] == 'pass':
            n_samples_by_task_id[task_id] = n
            break
        else:  # too hard
            n_samples_by_task_id[task_id] = None

In [50]:
print(n_samples_by_task_id.values())

dict_values([6, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, None, 10, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 1, 2, None, 3, 3, 1, 2, 1, 2, 1, 5, 1, 1, 2, 3, 5, 5, 1, 10, 1, 1, 1, 1, 1, 1, 1, 3, 3, None, 5, 1, 1, 1, 4, 1, None, 3, 2, 1, 1, None, 1, 2, None, 1, 1, 7, None, 1, 2, 1, 1, 2, 5, None, 2, None, None, 1, None, 1, None, 2, 1, 1, None, 2, 4, 2, 6, 1, 1, None, 1, 1, 1, 2, 1, 5, 1, 1, 2, 1, 1, 1, 1, 1, 8, 2, 1, None])


### Tasks in the sweet spot of difficulty - maybe we test on these?
These are tasks where the base model succeeded, but it took at least a couple tries.

When we steer the base model by features, does the number 5.05 decline?

In [51]:
challenging_tasks = {task_id: n_samples for task_id, n_samples in n_samples_by_task_id.items() if n_samples and n_samples > 2}
print(f"Average samples to solve these {len(challenging_tasks)} tasks: {np.mean(list(challenging_tasks.values())):.2f} ({sum(list(challenging_tasks.values()))} total runs)")
challenging_tasks

Average samples to solve these 22 tasks: 5.05 (111 total runs)


{'HumanEval/5': 6,
 'HumanEval/26': 5,
 'HumanEval/33': 10,
 'HumanEval/69': 3,
 'HumanEval/76': 3,
 'HumanEval/77': 3,
 'HumanEval/83': 5,
 'HumanEval/87': 3,
 'HumanEval/88': 5,
 'HumanEval/89': 5,
 'HumanEval/91': 10,
 'HumanEval/99': 3,
 'HumanEval/100': 3,
 'HumanEval/101': 5,
 'HumanEval/106': 4,
 'HumanEval/109': 3,
 'HumanEval/119': 7,
 'HumanEval/126': 5,
 'HumanEval/140': 4,
 'HumanEval/142': 6,
 'HumanEval/151': 5,
 'HumanEval/160': 8}

In [61]:
xs = np.array(list(n_samples_by_task_id.values()))
for x in [1, 2, None]:
    print(x, xs[xs == x].shape)

1 (102,)
2 (25,)
None (15,)


### Computational savings
The base model required 111 rollouts total to solve these 22 tasks.

The pass@10 metric required 1640 rollouts to compute.

This gives us the signal we're looking for with 6.7% of the compute - that's about 15x faster, or ~30 seconds instead of 8 minutes on an L40S.

<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>























## Further optimization: fit a Gaussian to n_samples, stop testing additional tasks once we're confident enough

Could provide another OOM of sample efficiency.

Probably not worth the additional complexity, but I didn't realize that until after I'd gotten this working... maybe I'll come back to this later.

In [43]:
import numpy as np
from typing import Tuple, List
from sklearn.mixture import GaussianMixture

MAX_SAMPLES = 10

def fit_gaussian(xs: List[int], fill_value = MAX_SAMPLES) -> Tuple[float, float]:
    gmm = GaussianMixture(n_components=1, covariance_type='spherical')
    xs = np.array(xs).reshape(-1, 1)
    xs[xs == None] = fill_value  # But really, this could be much larger than fill_value. How to handle this?
    gmm.fit(xs)
    mean = gmm.means_[0][0]
    var = np.sqrt(gmm.covariances_[0])
    return mean.item(), np.sqrt(var).item()

fit_gaussian([10, 16, 8, 5, 11, 15, 15, 15, 15, 13, 10, 15, 19, 5, 2, 60, 50])

(16.705882352941174, 3.842380205681849)

In [44]:
fails = 0
n_samples_by_task_id = {}
rollouts_averted = 0

MIN_TASKS = 10

for i, (task_id, samples) in enumerate(results['eval'].items()):
    # iterate through our samples, looking for the first one to pass
    for n, attempt in enumerate(samples):
        n += 1  # make n_samples 1-indexed
        if attempt['base_status'] == 'pass':
            n_samples_by_task_id[task_id] = n
            rollouts_averted += (MAX_SAMPLES - n)
            break
    else:  # no successful samples
        fails += 1
        assert n == MAX_SAMPLES
        n_samples_by_task_id[task_id] = None
    
    if len(n_samples_by_task_id) > MIN_TASKS:
        mean, stddev = fit_gaussian(list(n_samples_by_task_id.values()))
        if i % 5 == 0:
            print(f"mean samples for pass: {mean:.2f} (std: {stddev:.2f})")

n_tasks = len(results['eval'])
n_runs = n_tasks * MAX_SAMPLES  # assumes pass@k, which generates MAX_SAMPLES for each problem
print(f"pass@10: {1-fails/n_tasks:.3f}, rollouts averted: {rollouts_averted} ({rollouts_averted/n_runs*100:.1f}%)")

mean samples for pass: 1.73 (std: 1.19)
mean samples for pass: 1.56 (std: 1.11)
mean samples for pass: 1.52 (std: 1.05)
mean samples for pass: 1.58 (std: 1.10)
mean samples for pass: 1.48 (std: 1.06)
mean samples for pass: 1.94 (std: 1.49)
mean samples for pass: 1.85 (std: 1.45)
mean samples for pass: 1.76 (std: 1.41)
mean samples for pass: 1.69 (std: 1.38)
mean samples for pass: 1.62 (std: 1.36)
mean samples for pass: 1.57 (std: 1.33)
mean samples for pass: 1.55 (std: 1.31)
mean samples for pass: 1.55 (std: 1.29)
mean samples for pass: 1.64 (std: 1.37)
mean samples for pass: 1.67 (std: 1.35)
mean samples for pass: 1.69 (std: 1.35)
mean samples for pass: 1.77 (std: 1.36)
mean samples for pass: 1.82 (std: 1.41)
mean samples for pass: 1.82 (std: 1.40)
mean samples for pass: 1.91 (std: 1.44)
mean samples for pass: 2.00 (std: 1.48)
mean samples for pass: 2.04 (std: 1.51)
mean samples for pass: 2.20 (std: 1.58)
mean samples for pass: 2.17 (std: 1.56)
mean samples for pass: 2.37 (std: 1.64)
