MetaAI-Mini Supplementary Experiment

MetaAI-Mini is a tiny, fully reproducible teaching experiment for recursive self-design. It uses the first 10 official records from the OpenAI HumanEval benchmark and evaluates a self-improving Python coding agent over five generations.

This package is intentionally small. It is not a replacement for DGM, SWE-bench, or repository-level coding-agent evaluation. Its purpose is to make the mechanics of "human zero-to-one, AI one-to-N" inspectable on a laptop.

Files

README.md
requirements.txt
seed_agent.py
self_improve.py
analyze.py
data/
  HumanEval.jsonl.gz
  source_metadata.json
  tasks.json
figures/
  dgm_published_results.pdf
  metai_mini_protocol.pdf
results/
  .gitkeep

Data Provenance

data/tasks.json contains the first 10 official records from the OpenAI HumanEval data file. It does not contain 50 tasks and does not contain the full 164-task HumanEval benchmark:

https://raw.githubusercontent.com/openai/human-eval/master/data/HumanEval.jsonl.gz

The task records were mechanically extracted without rewriting the prompts or tests. The metadata file records SHA-256 checksums and the UTC generation timestamp:

data/source_metadata.json

Each task includes:

task_id, prompt, entry_point, test

Setup

Use Python 3.10 or newer.

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Set your OpenAI key and, optionally, a model:

set OPENAI_API_KEY=your_key_here
set OPENAI_MODEL=gpt-4.1-mini

On PowerShell:

$env:OPENAI_API_KEY="your_key_here"
$env:OPENAI_MODEL="gpt-4.1-mini"

Run

python self_improve.py --generations 5 --reset-results
python analyze.py

Outputs are written to:

results/scores.csv
results/agent_gen0.py
results/agent_gen*_candidate.py
results/agent_gen*.py
results/improvement_curve.pdf

results/scores.csv includes a run_type column. Rows from real OpenAI API-backed runs use run_type=api. Rows from smoke tests use run_type=mock.

Manuscript Figures

The paper inserts pre-generated PDF figures rather than drawing experimental figures directly in LaTeX. Regenerate the manuscript figures with:

python analyze.py --paper-figures --skip-curve

This writes:

figures/dgm_published_results.pdf
figures/metai_mini_protocol.pdf

figures/dgm_published_results.pdf visualizes published DGM endpoint and ablation values that are hard-coded in analyze.py with source comments. It is not a new DGM run. figures/metai_mini_protocol.pdf is a procedural schematic, not a performance measurement.

Smoke Test Without an API Key

The following checks the local evaluation pipeline but does not perform real self-improvement:

python self_improve.py --generations 1 --mock --reset-results
python analyze.py --allow-mock

Mock runs are only smoke tests for local plumbing. They do not call the OpenAI API, do not perform real self-improvement, and must not be reported in the paper. By default, python analyze.py refuses to generate the formal results/improvement_curve.pdf from mock-only rows; --allow-mock is provided only for smoke-test plotting.

Expected Behavior

The seed agent is intentionally weak and usually solves few or no tasks. With a capable model, later generations may solve more of the 10 tasks, but improvement is not guaranteed because outputs depend on model behavior, API version, and stochastic decoding. Treat results/scores.csv from your own API-backed run as the only experimental result. Do not report MetaAI-Mini performance in the paper unless results/scores.csv contains real run_type=api rows.

Safety

This package executes generated Python code locally. It uses subprocesses and timeouts, but it is not a secure sandbox. Run it in an isolated environment if you allow arbitrary model-generated code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MetaAI-Mini Supplementary Experiment

Files

Data Provenance

Setup

Run

Manuscript Figures

Smoke Test Without an API Key

Expected Behavior

Safety

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
figures		figures
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
analyze.py		analyze.py
requirements.txt		requirements.txt
seed_agent.py		seed_agent.py
self_improve.py		self_improve.py

Folders and files

Latest commit

History

Repository files navigation

MetaAI-Mini Supplementary Experiment

Files

Data Provenance

Setup

Run

Manuscript Figures

Smoke Test Without an API Key

Expected Behavior

Safety

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages