Skip to content

DunLi-Tsinghua/MetaAI-Mini

Repository files navigation

MetaAI-Mini Supplementary Experiment

MetaAI-Mini is a tiny, fully reproducible teaching experiment for recursive self-design. It uses the first 10 official records from the OpenAI HumanEval benchmark and evaluates a self-improving Python coding agent over five generations.

This package is intentionally small. It is not a replacement for DGM, SWE-bench, or repository-level coding-agent evaluation. Its purpose is to make the mechanics of "human zero-to-one, AI one-to-N" inspectable on a laptop.

Files

README.md
requirements.txt
seed_agent.py
self_improve.py
analyze.py
data/
  HumanEval.jsonl.gz
  source_metadata.json
  tasks.json
figures/
  dgm_published_results.pdf
  metai_mini_protocol.pdf
results/
  .gitkeep

Data Provenance

data/tasks.json contains the first 10 official records from the OpenAI HumanEval data file. It does not contain 50 tasks and does not contain the full 164-task HumanEval benchmark:

https://raw.githubusercontent.com/openai/human-eval/master/data/HumanEval.jsonl.gz

The task records were mechanically extracted without rewriting the prompts or tests. The metadata file records SHA-256 checksums and the UTC generation timestamp:

data/source_metadata.json

Each task includes:

task_id, prompt, entry_point, test

Setup

Use Python 3.10 or newer.

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Set your OpenAI key and, optionally, a model:

set OPENAI_API_KEY=your_key_here
set OPENAI_MODEL=gpt-4.1-mini

On PowerShell:

$env:OPENAI_API_KEY="your_key_here"
$env:OPENAI_MODEL="gpt-4.1-mini"

Run

python self_improve.py --generations 5 --reset-results
python analyze.py

Outputs are written to:

results/scores.csv
results/agent_gen0.py
results/agent_gen*_candidate.py
results/agent_gen*.py
results/improvement_curve.pdf

results/scores.csv includes a run_type column. Rows from real OpenAI API-backed runs use run_type=api. Rows from smoke tests use run_type=mock.

Manuscript Figures

The paper inserts pre-generated PDF figures rather than drawing experimental figures directly in LaTeX. Regenerate the manuscript figures with:

python analyze.py --paper-figures --skip-curve

This writes:

figures/dgm_published_results.pdf
figures/metai_mini_protocol.pdf

figures/dgm_published_results.pdf visualizes published DGM endpoint and ablation values that are hard-coded in analyze.py with source comments. It is not a new DGM run. figures/metai_mini_protocol.pdf is a procedural schematic, not a performance measurement.

Smoke Test Without an API Key

The following checks the local evaluation pipeline but does not perform real self-improvement:

python self_improve.py --generations 1 --mock --reset-results
python analyze.py --allow-mock

Mock runs are only smoke tests for local plumbing. They do not call the OpenAI API, do not perform real self-improvement, and must not be reported in the paper. By default, python analyze.py refuses to generate the formal results/improvement_curve.pdf from mock-only rows; --allow-mock is provided only for smoke-test plotting.

Expected Behavior

The seed agent is intentionally weak and usually solves few or no tasks. With a capable model, later generations may solve more of the 10 tasks, but improvement is not guaranteed because outputs depend on model behavior, API version, and stochastic decoding. Treat results/scores.csv from your own API-backed run as the only experimental result. Do not report MetaAI-Mini performance in the paper unless results/scores.csv contains real run_type=api rows.

Safety

This package executes generated Python code locally. It uses subprocesses and timeouts, but it is not a secure sandbox. Run it in an isolated environment if you allow arbitrary model-generated code.

About

A reproducible mini experiment package for MetaAI recursive self-design on the first 10 official HumanEval tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages