MetaAI-Mini is a tiny, fully reproducible teaching experiment for recursive self-design. It uses the first 10 official records from the OpenAI HumanEval benchmark and evaluates a self-improving Python coding agent over five generations.
This package is intentionally small. It is not a replacement for DGM, SWE-bench, or repository-level coding-agent evaluation. Its purpose is to make the mechanics of "human zero-to-one, AI one-to-N" inspectable on a laptop.
README.md
requirements.txt
seed_agent.py
self_improve.py
analyze.py
data/
HumanEval.jsonl.gz
source_metadata.json
tasks.json
figures/
dgm_published_results.pdf
metai_mini_protocol.pdf
results/
.gitkeep
data/tasks.json contains the first 10 official records from the OpenAI HumanEval data file. It does not contain 50 tasks and does not contain the full 164-task HumanEval benchmark:
https://raw.githubusercontent.com/openai/human-eval/master/data/HumanEval.jsonl.gz
The task records were mechanically extracted without rewriting the prompts or tests. The metadata file records SHA-256 checksums and the UTC generation timestamp:
data/source_metadata.json
Each task includes:
task_id, prompt, entry_point, test
Use Python 3.10 or newer.
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtSet your OpenAI key and, optionally, a model:
set OPENAI_API_KEY=your_key_here
set OPENAI_MODEL=gpt-4.1-miniOn PowerShell:
$env:OPENAI_API_KEY="your_key_here"
$env:OPENAI_MODEL="gpt-4.1-mini"python self_improve.py --generations 5 --reset-results
python analyze.pyOutputs are written to:
results/scores.csv
results/agent_gen0.py
results/agent_gen*_candidate.py
results/agent_gen*.py
results/improvement_curve.pdf
results/scores.csv includes a run_type column. Rows from real OpenAI API-backed runs use run_type=api. Rows from smoke tests use run_type=mock.
The paper inserts pre-generated PDF figures rather than drawing experimental figures directly in LaTeX. Regenerate the manuscript figures with:
python analyze.py --paper-figures --skip-curveThis writes:
figures/dgm_published_results.pdf
figures/metai_mini_protocol.pdf
figures/dgm_published_results.pdf visualizes published DGM endpoint and ablation values that are hard-coded in analyze.py with source comments. It is not a new DGM run. figures/metai_mini_protocol.pdf is a procedural schematic, not a performance measurement.
The following checks the local evaluation pipeline but does not perform real self-improvement:
python self_improve.py --generations 1 --mock --reset-results
python analyze.py --allow-mockMock runs are only smoke tests for local plumbing. They do not call the OpenAI API, do not perform real self-improvement, and must not be reported in the paper. By default, python analyze.py refuses to generate the formal results/improvement_curve.pdf from mock-only rows; --allow-mock is provided only for smoke-test plotting.
The seed agent is intentionally weak and usually solves few or no tasks. With a capable model, later generations may solve more of the 10 tasks, but improvement is not guaranteed because outputs depend on model behavior, API version, and stochastic decoding. Treat results/scores.csv from your own API-backed run as the only experimental result. Do not report MetaAI-Mini performance in the paper unless results/scores.csv contains real run_type=api rows.
This package executes generated Python code locally. It uses subprocesses and timeouts, but it is not a secure sandbox. Run it in an isolated environment if you allow arbitrary model-generated code.