Skip to content

Artificial Intelligence research: LLMs increases coverage not efficiency | Generated by Idea Explorer on 2025-11-23

Notifications You must be signed in to change notification settings

ChicagoHAI/llm-coverage-efficiency-codex

Repository files navigation

Overview

Experiments on whether LLM copilots improve efficiency or mainly expand coverage. We run multi-attempt code generation on MBPP+ using DeepSeek-Coder-6.7B and evaluate with EvalPlus base/augmented tests.

Key Findings

  • Multi-attempt sampling (5 shots) doubled base correctness on MBPP (pass@1 0.35 → pass@5 0.70); plus tests rose 0.25 → 0.55.
  • Half of successful tasks passed only after the first attempt, showing coverage—not first-try speed—drives gains.
  • Latency to first pass averaged ~5s vs. ~3.3s per single attempt, so efficiency gains are modest while coverage leaps.

How to Reproduce

  1. Activate env: source .venv/bin/activate
  2. Run experiment (MBPP by default): python notebooks/coverage_vs_efficiency.py
    • Config variables (dataset, task_count, attempts, temperature) are at the top of notebooks/coverage_vs_efficiency.py.
  3. Outputs:
    • Metrics: results/metrics.json
    • Samples: results/mbpp_samples.jsonl
    • Eval records: results/mbpp_eval.json
    • Plots: results/plots/mbpp_pass_rates.png, results/plots/mbpp_success_attempts.png

File Structure

  • planning.md — research plan.
  • notebooks/coverage_vs_efficiency.py — generation + evaluation script.
  • results/ — metrics, eval logs, plots, analysis summaries.
  • datasets/ — pre-downloaded MBPP/HumanEval data.
  • code/ — cloned evalplus/human-eval baselines (unused directly; evalplus installed).

Notes

  • Torch GPU build (torch==2.5.1+cu124) is installed; script auto-uses GPU if available.
  • To switch to HumanEval, set dataset = "humaneval" in the script; adjust task_count/temperature as desired.

About

Artificial Intelligence research: LLMs increases coverage not efficiency | Generated by Idea Explorer on 2025-11-23

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages