Experiments on whether LLM copilots improve efficiency or mainly expand coverage. We run multi-attempt code generation on MBPP+ using DeepSeek-Coder-6.7B and evaluate with EvalPlus base/augmented tests.
- Multi-attempt sampling (5 shots) doubled base correctness on MBPP (pass@1 0.35 → pass@5 0.70); plus tests rose 0.25 → 0.55.
- Half of successful tasks passed only after the first attempt, showing coverage—not first-try speed—drives gains.
- Latency to first pass averaged ~5s vs. ~3.3s per single attempt, so efficiency gains are modest while coverage leaps.
- Activate env:
source .venv/bin/activate - Run experiment (MBPP by default):
python notebooks/coverage_vs_efficiency.py- Config variables (dataset, task_count, attempts, temperature) are at the top of
notebooks/coverage_vs_efficiency.py.
- Config variables (dataset, task_count, attempts, temperature) are at the top of
- Outputs:
- Metrics:
results/metrics.json - Samples:
results/mbpp_samples.jsonl - Eval records:
results/mbpp_eval.json - Plots:
results/plots/mbpp_pass_rates.png,results/plots/mbpp_success_attempts.png
- Metrics:
planning.md— research plan.notebooks/coverage_vs_efficiency.py— generation + evaluation script.results/— metrics, eval logs, plots, analysis summaries.datasets/— pre-downloaded MBPP/HumanEval data.code/— cloned evalplus/human-eval baselines (unused directly; evalplus installed).
- Torch GPU build (
torch==2.5.1+cu124) is installed; script auto-uses GPU if available. - To switch to HumanEval, set
dataset = "humaneval"in the script; adjusttask_count/temperatureas desired.