OpenRepro-Agent is a Python CLI workflow for paper reproduction projects. It initializes a reproducible workspace, ingests Markdown/txt/PDF sources, extracts candidate formulas and parameters, plans experiments, scaffolds human-gated experiment code, runs lightweight demos and parameter sweeps, validates generated artifacts, inspects project state, runs workflow-compliance benchmarks and suites, indexes benchmark evidence, classifies failures, tracks cache-aware provider usage, and produces multi-agent handoff files and evidence packages.
Current version: v1.2.0. This is still an alpha engineering scaffold, not a finished autonomous paper-reproduction system.
Research-paper reproduction often fails because notes, assumptions, formulas, experiment code, logs, and reports are scattered across folders or chat histories. OpenRepro-Agent focuses on making the project loop runnable, inspectable, and auditable before adding more ambitious automation.
The v1.2.0 workflow is:
init → configure-provider → ingest → analyze → plan → list-templates → list-candidates → review-candidates → approve-candidates → scaffold-experiment → set-input → validate-inputs → validate-experiment-spec → run-experiment → rerun-experiment → compare-experiments → run-demo → validate --all → inspect → diagnose → repair-plan → repair --dry-run → run-sweep → compare-runs → lineage → doctor → benchmark → benchmark-suite → benchmark-index → report → handoff → evidence-package → status
- Create a standard paper reproduction project directory.
- Ingest Markdown and text notes into
sources/. - Ingest PDFs, extract text and page-level provenance with
pdfplumber, and record extraction status. - Generate
paper_summary.mdusing rule-based keyword detection and candidate extraction. - Generate Markdown and JSON model ledgers with formula and parameter candidates marked
candidate_unverified. - Generate
EXPERIMENT_PLAN.mdandexperiment_plan_validation.json. - Configure mock or OpenAI-compatible providers while keeping real calls disabled by default.
- Scaffold human-gated experiment folders from candidate formulas and parameters.
- Run a lightweight BOC-like signal and autocorrelation demo.
- Run a noise/seed parameter sweep for the built-in demo.
- Save run artifacts: logs, figures, data, metrics, reports, config snapshots, metadata, manifest, mock API usage, and handoff notes.
- Validate run manifests, required artifacts, file sizes, and SHA-256 hashes.
- Validate every project run at once with
openrepro validate --all. - Inspect project state and write
workspace/inspect_summary.json. - Diagnose common workflow failures and suggest repairs.
- Generate advisory repair plans with
workspace/repair_plan.jsonandworkspace/REPAIR_PLAN.md. - Compare two run directories and write run comparison artifacts.
- Run workflow-compliance benchmark tasks from
benchmarks/benchmark_schema.json. - Run benchmark suites from
benchmarks/sample_suite.json. - Generate
benchmark_index.jsonandbenchmark_index.mdfor benchmark runs. - Use a deterministic mock provider interface with request-hash cache accounting.
- Generate a project-level Markdown report.
- Generate multi-agent handoff files for Claude Code, Codex, GitHub Copilot, or human maintainers.
- Run pytest tests for the minimum workflow.
- Promote human-reviewed formula and parameter candidates into
workspace/verified_candidates.json. - Generate
workspace/VERIFIED_CANDIDATES.mdfor auditable approval notes. - Let experiment scaffolds detect verified candidates and mark the scaffold as
verified_inputs_ready. - Preview controlled repair actions with
openrepro repair --dry-run. - Generate repair previews in
workspace/repair_dry_run.jsonandworkspace/REPAIR_DRY_RUN.md. - Include manifest regeneration diffs in dry-run previews for manifest mismatch or missing-manifest cases.
- Surface verified candidate counts and approval status in
openrepro inspect. - Surface latest repair dry-run status and action counts in
openrepro inspect. - Include verified candidate and repair dry-run summaries in project reports.
- Include verified candidate and repair dry-run handoff files.
- Suggest
approve-candidatesfromopenrepro statuswhen analysis and planning are complete but candidates are not approved.
- Redact likely secrets, tokens, API keys, and email addresses from provider usage previews.
- Store provider cache entries under provider/model/task namespaces.
- Add provider cache policy fields:
cache_enabled,cache_ttl_seconds, andredact_prompts. - Allow
configure-providerto update cache and redaction policy without storing secrets. - Include redacted prompt/response previews and cache namespace metadata in usage records.
- Generate run lineage artifacts with
openrepro lineage. - Record manifest, config, source index, and verified candidate hashes for every run.
- Add benchmark provenance fields for dataset, environment, dependencies, paper source, and runtime notes.
- Surface benchmark provenance completeness in benchmark reports, suites, and indexes.
- Apply explicitly confirmed manifest-only repairs with
openrepro repair --apply --only manifest --confirm. - Write
workspace/repair_apply.jsonandworkspace/REPAIR_APPLY.md. - Keep repair application limited to manifest regeneration from files already present on disk.
- Add
openrepro doctorfor dependency, project structure, config, and provider readiness checks. - Surface lineage status in
inspect,status, and handoff files. - Add
handoff/RUN_LINEAGE.mdto generated handoff bundles. - Expand smoke tests to cover approval, lineage, repair dry-run/apply, and doctor commands.
- Add
openrepro run-experimentfor confirmed execution of verified experiment scaffolds. - Require experiment status
verified_inputs_readyand explicit--confirm. - Capture runner stdout/stderr, execution metadata, report, config snapshot, runner copy, and manifest.
- Keep experiment runs as execution evidence only, not scientific reproduction claims.
- Add
openrepro list-candidatesto inspect formula and parameter candidates with review status. - Add
openrepro review-candidateswith statusesverified_by_human,rejected_by_human, andneeds_more_evidence. - Write
workspace/candidate_reviews.jsonandworkspace/CANDIDATE_REVIEWS.md. - Sync
verified_by_humanreviews into the existing verified candidate approval artifact.
- Improve
openrepro statusnext-step suggestions for list/review/scaffold/run-experiment flows. - Add candidate review and experiment-run counts to status and inspect output.
- Expand smoke scripts to cover the v0.7.x command set.
- Add release tags for recent versions.
- Write
workspace/paper_metadata.jsonwith title and DOI candidates. - Add source path, chunk index, page number, context window, and evidence quality to formula candidates.
- Add numeric values, normalized units, context window, and evidence quality to parameter candidates.
- Preserve table/page provenance for PDF table-derived parameter candidates.
- Add
--template basic|boc-like|numeric-sweeptoopenrepro scaffold-experiment. - Generate template runners for BOC-like traces and numeric sweeps when verified inputs are available.
- Align
expected_artifacts.jsonwith the currentrun-experimentoutput layout. - Have
run-experimentpassOPENREPRO_RUN_DIRto runners and include template-required artifacts in the run manifest.
- Add
openrepro list-templatesfor supported experiment templates. - Move template metadata into a shared template registry.
- Surface scaffold template counts and expected-artifact attention counts in
inspectandstatus. - Diagnose legacy, missing, mismatched, or tampered
expected_artifacts.jsonfiles for experiment scaffolds.
- Write
experiments/<id>/experiment_inputs.jsonfrom human-verified formula and parameter candidates. - Map parameter candidates into
parameter_valuesfor template runners. - Add input completeness checks for template-required inputs such as
code_lengthandnoise_std. - Have template runners read
OPENREPRO_EXPERIMENT_INPUTSinstead of relying only on hard-coded defaults. - Snapshot experiment inputs into run outputs and surface the mapping in run reports and handoff files.
- Write
configs/environment_snapshot.jsonforrun-experiment. - Record Python, platform, dependency versions, random seed, runner hash, and repeatability status.
- Add environment snapshots to required run-experiment artifacts.
- Extend lineage with experiment config, input, environment, and runner hashes.
- Add a lightweight same-seed repeatability check against prior experiment runs.
- Add
openrepro validate-inputsfor experiment input completeness checks. - Add
openrepro set-inputfor manual input calibration and overrides. - Track input sources as
verified_candidate,manual_override, ordefault. - Surface missing required input counts in
inspectandstatus. - Record input validation details and warnings in experiment run evidence.
- Add
openrepro rerun-experimentfor repeat execution of an existing verified scaffold. - Add
openrepro compare-experimentsfor same-experiment metric and hash comparison. - Write
workspace/experiment_comparison.jsonandworkspace/EXPERIMENT_COMPARISON.md. - Add normalized input hashing so timestamp refreshes do not hide equivalent inputs.
- Extend lineage with experiment repeat groups and repeat run indexes.
- Add
openrepro evidence-packagefor project-level evidence packaging. - Write
reports/evidence_package.jsonandreports/evidence_package.md. - Summarize project status, inspect output, workspace artifacts, experiments, runs, lineage, benchmark indexes, reports, and handoff completeness.
- Keep evidence-package policy explicit: workflow evidence is not a scientific reproduction claim.
- Add v1.0.0 regression tests and smoke coverage for evidence package generation.
- Add evidence package source fingerprints and artifact SHA-256 hashes.
- Add freshness detection so
statuscan report missing, current, or stale evidence packages. - Add
openrepro evidence-package --zipfor compact handoff export. - Add
handoff/EVIDENCE_PACKAGE.md. - Add stale-package and zip-export regression tests.
- Add section-aware evidence provenance for formula and parameter candidates.
- Add
workspace/caption_index.jsonandworkspace/CAPTION_INDEX.md. - Add candidate risk flags and high-risk candidate counts.
- Surface candidate risk levels in
openrepro inspect. - Include caption evidence in evidence packages.
- Add
experiments/<id>/experiment_spec.jsonas an execution contract. - Add
openrepro validate-experiment-spec. - Validate experiment specs before
run-experiment. - Snapshot specs into run outputs and required run manifests.
- Compare and lineage experiment spec hashes across runs.
- It does not fully read or understand papers.
- It does not verify mathematical formulas automatically.
- It does not generate full simulation code for arbitrary papers.
- It does not automatically repair failed experiments.
- It does not apply repair previews automatically; dry-run output is for review.
- It does not enable real LLM providers by default; OpenAI-compatible calls require explicit opt-in and environment-backed secrets.
- It does not claim benchmark scores, user counts, token usage, or efficiency improvements.
- The BOC demo is a lightweight BOC-like demo, not a complete BOC acquisition/tracking implementation and not a full reproduction of any paper.
git clone <your-fork-url> OpenRepro-Agent
cd OpenRepro-Agent
python -m venv .venv
# Windows PowerShell
.venv\Scripts\Activate.ps1
# macOS / Linux
source .venv/bin/activate
pip install -e ".[dev]"Python 3.10+ is required.
Run the unit test suite:
python -m pytest -qOn Windows, if pytest cannot access its default temp directory, use a workspace-local base temp directory:
New-Item -ItemType Directory -Force .codex_tmp\pytest-basetemp | Out-Null
python -m pytest -q --basetemp .codex_tmp\pytest-basetempopenrepro init boc_demo
openrepro configure-provider boc_demo --provider mock --disable-real-api
openrepro ingest boc_demo --source examples/boc_notes.md
openrepro analyze boc_demo
openrepro plan boc_demo
openrepro list-templates
openrepro list-candidates boc_demo
openrepro review-candidates boc_demo --candidate-id F001 --status needs_more_evidence --reviewer human
openrepro approve-candidates boc_demo --all --reviewer human
openrepro scaffold-experiment boc_demo --experiment-id boc_candidate_exp --template boc-like
openrepro set-input boc_demo --experiment-id boc_candidate_exp --name noise_std --value 0.05
openrepro set-input boc_demo --experiment-id boc_candidate_exp --name code_length --value 128
openrepro validate-inputs boc_demo --experiment-id boc_candidate_exp
openrepro validate-experiment-spec boc_demo --experiment-id boc_candidate_exp
openrepro run-experiment boc_demo --experiment-id boc_candidate_exp --confirm
openrepro rerun-experiment boc_demo --experiment-id boc_candidate_exp --confirm
openrepro compare-experiments boc_demo --experiment-id boc_candidate_exp
openrepro run-demo boc_demo
openrepro validate boc_demo
openrepro validate boc_demo --all
openrepro inspect boc_demo
openrepro diagnose boc_demo
openrepro repair-plan boc_demo
openrepro repair boc_demo --dry-run
openrepro run-sweep boc_demo --noise-std 0.0 --noise-std 0.1 --seed 42
openrepro validate boc_demo
openrepro compare-runs boc_demo
openrepro lineage boc_demo
openrepro doctor boc_demo
openrepro benchmark --task benchmarks/sample_task.json --project boc_benchmark
openrepro benchmark-suite --suite benchmarks/sample_suite.json --project-prefix boc_suite
openrepro benchmark-index
openrepro report boc_demo
openrepro handoff boc_demo
openrepro evidence-package boc_demo --zip
openrepro status boc_demoPDF ingestion is also supported:
openrepro ingest boc_demo --source path/to/paper.pdfPDF text is extracted to:
workspace/extracted_sources/<paper>.txt
workspace/extracted_sources/<paper>.pages.json
Creates:
boc_demo/
project_config.yaml
sources/
workspace/
outputs/
handoff/
reports/
logs/
It also creates initial handoff files:
handoff/PROJECT_CONTEXT.md
handoff/AGENT_HANDOFF.md
handoff/NEXT_STEPS.md
Copies Markdown/txt/PDF sources into sources/ and updates:
workspace/source_index.json
For PDFs, v0.4.0 records:
extraction_statusextracted_text_pathpages_pathpage_countchar_counttable_count
Extraction failures do not remove the copied source. They are recorded as extraction_failed so the rest of the workflow can continue.
Generates:
workspace/paper_summary.md
workspace/MODEL_LEDGER.md
workspace/analysis_result.json
workspace/formula_candidates.json
workspace/parameter_candidates.json
workspace/model_ledger.json
workspace/paper_metadata.json
workspace/caption_index.json
workspace/CAPTION_INDEX.md
The analyzer is rule-based. Formula, parameter, and model records are candidates and require human verification. Formula and parameter candidates include section labels, context windows, evidence quality, and risk flags such as missing page anchors or thin context.
Generates:
workspace/EXPERIMENT_PLAN.md
workspace/experiment_plan_validation.json
The validation file checks whether sources exist, PDF extraction needs review, candidate formulas/parameters were detected, and demo configuration values are valid.
Updates provider settings in project_config.yaml without storing secrets. Mock mode remains the default:
openrepro configure-provider boc_demo --provider mock --disable-real-apiOpenAI-compatible calls require explicit opt-in and an environment variable:
openrepro configure-provider boc_demo --provider openai --model gpt-4.1-mini --enable-real-api --api-key-env OPENAI_API_KEYThe command reports whether the configured provider is ready for real calls; it never prints or stores API key values.
Provider cache and redaction policy can also be configured:
openrepro configure-provider boc_demo --provider mock --cache-enabled --cache-ttl-seconds 86400 --redact-promptsCreates a human-gated experiment scaffold under:
experiments/<experiment_id>/
README.md
APPROVAL_REQUIRED.md
experiment_config.json
experiment_spec.json
expected_artifacts.json
experiment_inputs.json
runner.py
runner_stub.py
The scaffold is generated from candidate formulas and parameters and is marked approval_required unless verified candidates exist or --acknowledge-candidates is provided. It is a coding starting point, not a reproduction claim.
Use --template basic, --template boc-like, or --template numeric-sweep to choose the starter runner. Template runners are created only for runnable scaffolds, and they write declared outputs under OPENREPRO_RUN_DIR during run-experiment.
Verified candidates are also mapped into experiment_inputs.json. The file
contains formula evidence, parameter records, parameter_values, and input
completeness status. Template runners read it through
OPENREPRO_EXPERIMENT_INPUTS.
Use openrepro set-input to add or override individual values. Manual values
are marked as manual_override and validation artifacts are written under
workspace/experiment_input_validation.json and
workspace/EXPERIMENT_INPUT_VALIDATION.md.
Checks whether the scaffold has all inputs required by its template. The command exits non-zero when required inputs are missing.
Validates the experiment contract and writes:
workspace/experiment_spec_validation.json
workspace/EXPERIMENT_SPEC_VALIDATION.md
The contract records template, input, runner, artifact, and metric expectations. Validation checks engineering consistency only; it does not verify scientific correctness.
Sets an input value in experiment_inputs.json and refreshes input validation.
Values may be JSON scalars or comma-separated lists.
Lists supported experiment scaffold templates, their purpose, required
artifacts, and input hints. The current templates are basic, boc-like, and
numeric-sweep.
Writes human approval artifacts for selected candidate formulas and parameters:
workspace/verified_candidates.json
workspace/VERIFIED_CANDIDATES.md
Approve all currently detected candidates:
openrepro approve-candidates boc_demo --all --reviewer human --note "Checked against paper notes."Or approve specific candidate IDs:
openrepro approve-candidates boc_demo --formula-id F001 --parameter-id P001Verified candidates are implementation inputs only. They do not prove that the paper has been reproduced.
Lists formula and parameter candidates with their latest review status. By
default, unreviewed records keep candidate_unverified.
Records a human review status for one or more candidates:
openrepro review-candidates boc_demo --candidate-id F001 --status needs_more_evidence --reviewer human --note "Need page-level context."
openrepro review-candidates boc_demo --candidate-id P001 --status rejected_by_human --reviewer human
openrepro review-candidates boc_demo --candidate-id F002 --status verified_by_human --reviewer humanReview artifacts are written to:
workspace/candidate_reviews.json
workspace/CANDIDATE_REVIEWS.md
When a review uses verified_by_human, OpenRepro-Agent also updates
workspace/verified_candidates.json for compatibility with experiment
scaffolding and run-experiment.
Runs a verified experiment scaffold and records execution evidence in a timestamped output directory. The command refuses to run unless:
--confirmis provided.experiments/<id>/experiment_config.jsonhasstatus: verified_inputs_ready.- the experiment config marks the scaffold as runnable.
Outputs include:
logs/run.log
data/execution_result.json
reports/experiment_report.md
configs/experiment_config_snapshot.json
configs/experiment_inputs_snapshot.json
configs/experiment_spec_snapshot.json
configs/environment_snapshot.json
code/runner.py
metadata.json
manifest.json
The command executes the scaffold runner and records evidence only. It does not claim paper reproduction success.
For v0.8.1 scaffolds, run-experiment reads
experiments/<id>/expected_artifacts.json and includes those required paths in
the run manifest, so template-specific outputs are validated with the rest of
the run evidence. If a runner exits successfully but omits required template
artifacts, the command fails with an artifact-validation error.
Runs the same verified experiment scaffold again and records another normal
run-experiment output directory. It uses the same guardrails as
run-experiment: the scaffold must be runnable, inputs are refreshed, and the
command needs explicit --confirm.
openrepro compare-experiments <project_name> --experiment-id ID [--left-run PATH] [--right-run PATH]
Compares two run-experiment outputs for the same experiment, defaulting to
the latest two runs for that experiment id, and writes:
workspace/experiment_comparison.json
workspace/EXPERIMENT_COMPARISON.md
The comparison reports metric equality, metric deltas, runner hashes, raw input hashes, normalized input hashes, spec hashes, and environment hashes. It is repeatability evidence only; it does not claim paper reproduction success.
Creates a timestamped output directory, for example:
outputs/2026-xx-xx_20-30-15_boc_demo/
logs/run.log
figures/correlation.png
data/demo_signal.npy
data/correlation.npy
data/demo_metrics.json
reports/demo_report.md
configs/project_config_snapshot.yaml
code/README.md
api_usage/api_usage.jsonl
api_usage/api_usage_summary.json
handoff/AGENT_HANDOFF.md
metadata.json
manifest.json
The demo generates a pseudo-random spreading code, a square-wave subcarrier, a lightweight BOC-like signal, a noisy observation, and a normalized autocorrelation function.
Validates the latest run directory by default, or a specific run directory when --run-dir is provided.
It checks:
manifest.jsonexists and is readable- required artifacts exist
- manifest entries match current file size
- manifest entries match current SHA-256 hashes
The command exits with code 0 when valid and code 1 when validation fails.
Use --all to validate every run under outputs/ in one pass:
openrepro validate boc_demo --allThe all-runs mode prints a table and includes diagnosis suggestions for any failed run.
Prints a compact project health table covering sources, PDF extraction status, formula and parameter candidate counts, run counts, the latest manifest status, benchmark run count, diagnosis health, and the next suggested command.
It also writes:
workspace/inspect_summary.json
The JSON summary is intended for agents and automation that need the same state snapshot without parsing terminal output.
Classifies project or run failures and suggests repairs. It covers missing artifacts, manifest mismatches, missing source files, PDF extraction failures, invalid demo config, provider-disabled errors, and unknown runtime errors.
Writes advisory repair artifacts:
workspace/repair_plan.json
workspace/REPAIR_PLAN.md
v0.4.0 repair plans do not edit files automatically. They convert diagnosis output into ordered manual repair suggestions.
Writes a controlled repair preview without mutating project or run files:
workspace/repair_dry_run.json
workspace/REPAIR_DRY_RUN.md
For manifest mismatch and missing-manifest issues, the dry-run preview includes a unified diff showing how manifest.json would change if regenerated from files currently present on disk. Missing scientific artifacts are never fabricated.
To apply the manifest-only repair after reviewing the dry-run:
openrepro repair boc_demo --apply --only manifest --confirmv0.6.1 apply mode does not generate missing scientific artifacts, edit experiment code, change configuration values, or infer parameters.
Runs the built-in BOC-like demo across a noise/seed grid. Defaults:
noise_std = [0.0, 0.05, 0.1, 0.2]
seed = project_config.yaml demo.seed
Outputs include:
data/sweep_results.json
data/sweep_metrics.csv
figures/sweep_correlation_peak.png
reports/sweep_report.md
metadata.json
manifest.json
Compares two run directories, defaulting to the latest two runs, and writes:
workspace/run_comparison.json
workspace/RUN_COMPARISON.md
The comparison reports observed manifest status and metric differences only.
Writes project run lineage artifacts:
workspace/run_lineage.json
workspace/RUN_LINEAGE.md
Each run entry records the parent command and SHA-256 hashes for the run manifest, config snapshot, project source index, and verified candidates when available. Experiment runs also include repeat group ids and repeat indexes so same-experiment reruns can be audited. The lineage report is provenance evidence only; it does not claim scientific reproduction success.
Checks local dependencies, project structure, project config, and provider readiness. It writes:
workspace/doctor.json
workspace/DOCTOR.md
Doctor checks workflow readiness only; it does not claim scientific reproduction success.
Runs a workflow-compliance benchmark task. If --project is omitted, the task id becomes the project name. If the project does not exist, it is initialized automatically.
Outputs are written under:
benchmarks/runs/<timestamp>_<task_id>/
benchmark_result.json
benchmark_report.md
api_usage/api_usage.jsonl
api_usage/api_usage_summary.json
manifest.json
Benchmark results only report observed workflow evidence: source ingestion, generated artifacts, manifest validity, and metric availability. They do not claim paper reproduction success or scientific benchmark scores.
Benchmark task files may use either the v0.3.0 fields (expected_artifacts, evaluation_metrics) or the v0.4.0 fields:
{
"artifacts": {
"required": ["workspace/paper_summary.md"],
"optional": ["outputs/<timestamp>_<project>/reports/demo_report.md"]
},
"metrics": {
"required": ["signal_length"],
"optional": ["side_lobe_level"]
},
"workflow": {
"run_demo": true,
"run_sweep": false
},
"pass_criteria": {
"require_manifest_valid": true
}
}Optional artifacts and metrics are reported but do not make the benchmark status fail.
Runs a collection of benchmark tasks and writes suite-level evidence under:
benchmarks/runs/<timestamp>_<suite_id>_suite/
benchmark_suite_result.json
benchmark_suite_report.md
manifest.json
Suite output is still workflow-compliance evidence only; it does not aggregate scientific reproduction scores.
Rebuilds benchmark indexes for existing benchmark runs:
benchmarks/runs/benchmark_index.json
benchmarks/runs/benchmark_index.md
The index includes task id, status, creation time, benchmark directory, project directory, latest run directory, artifact pass count, metric pass count, manifest validity, and diagnosis count. It is regenerated automatically after every benchmark run.
Generates:
reports/report.md
Generates or updates:
handoff/PROJECT_CONTEXT.md
handoff/PAPER_SUMMARY.md
handoff/MODEL_LEDGER.md
handoff/EXPERIMENT_PLAN.md
handoff/CODE_STATUS.md
handoff/RUN_LOG_SUMMARY.md
handoff/ERROR_NOTES.md
handoff/NEXT_STEPS.md
handoff/AGENT_HANDOFF.md
Generates a project-level evidence bundle:
reports/evidence_package.json
reports/evidence_package.md
reports/evidence_package.zip
The package summarizes inspect output, workspace artifacts, experiment
scaffolds, run manifests, lineage, benchmark indexes, project reports, and
handoff completeness. It includes source fingerprints and artifact hashes so
openrepro status can report whether the package is current or stale. It is
intended as an auditable handover artifact, not as a scientific reproduction
claim.
Prints whether the project exists, whether each workflow stage has completed, the most recent run directory, report status, handoff completeness, and the next suggested command.
Each demo, sweep, or benchmark run writes:
manifest.json
The manifest records:
- schema version
- OpenRepro-Agent version
- run id
- command type
- created timestamp
- required artifacts
- relative artifact paths
- artifact category
- existence flag
- file size
- SHA-256 digest
This lets later agents, humans, and CI checks verify that reports and metrics are backed by actual files.
v0.4.0 keeps MockProvider as the default and adds an explicit opt-in OpenAI-compatible provider path. Real calls require:
api.enable_real_api: true- a non-mock provider such as
openai - an environment variable such as
OPENAI_API_KEY
Provider usage is recorded in:
api_usage/api_usage.jsonl
api_usage/api_usage_summary.json
Mock and cached events use zero prompt tokens, zero completion tokens, zero total tokens, and zero estimated cost. The summary keeps total_calls at zero for mocked and cached events so the project does not invent real API usage. Real provider events are counted only from provider-returned usage data, and costs stay zero unless an auditable provider estimate is available.
Tracked fields include:
- provider and model
- task type
- prompt/completion/total tokens
- estimated cost
- cache hit status
- call status
- request hash
The handoff/ directory is designed for both humans and coding agents. It separates project context, paper summary, model ledger, experiment plan, code status, run logs, error notes, next steps, and the final handoff memo.
Important rule: handoff files must distinguish between confirmed facts, assumptions, placeholders, candidates, and future work.
The benchmarks/ directory contains a task schema, a sample task, a sample suite, generated benchmark run outputs, and a rebuildable benchmark index. Benchmarks report workflow-compliance evidence and provenance only. They do not report scientific benchmark scores or claim a paper has been reproduced.
- v0.1.0: runnable CLI workflow and lightweight BOC-like demo.
- v0.2.0: PDF text extraction, artifact manifests, formula/parameter candidates, experiment plan validation, demo parameter sweeps.
- v0.3.0: provider interface, cache-aware API usage, benchmark runner, failure diagnosis and repair suggestions.
- v0.3.1: project inspection,
validate --all, benchmark indexing, and compatible benchmark schema hardening. - v0.4.0: opt-in OpenAI-compatible provider path, human-gated experiment scaffolds, benchmark suites, repair plans, and run comparison.
- v0.5.0: candidate approval artifacts, verified-input scaffolds, and controlled repair dry-run previews.
- v0.5.1: verified candidate and repair dry-run status surfaced through inspect, report, handoff, and status.
- v0.5.2: provider prompt/response preview redaction, cache namespace, and cache policy controls.
- v0.6.0: benchmark provenance fields and project run lineage artifacts.
- v0.6.1: confirmed manifest-only repair apply.
- v0.6.2: doctor checks, lineage status visibility, and expanded smoke coverage.
- v0.7.0: confirmed execution of verified experiment scaffolds.
- v0.7.1: candidate listing and review lifecycle.
- v0.7.2: status/inspect stabilization, smoke coverage, and release tag cleanup.
- v0.8.0: paper metadata, DOI candidates, and candidate evidence provenance.
- v0.8.1: experiment templates and template-specific run artifact validation.
- v0.8.2: template registry,
list-templates, and scaffold expected-artifact diagnostics. - v0.9.0: verified candidate to experiment input mapping, input snapshots, and input-aware template runners.
- v0.9.1: environment snapshots, runner/input/environment lineage hashes, and same-seed repeatability checks.
- v0.9.2: input validation, manual input overrides, and missing-input visibility.
- v0.9.3: repeat experiment execution, same-experiment comparison artifacts, and lineage repeat indexes.
- v1.0.0: project-level evidence packages for auditable handover.
- v1.0.1: evidence package freshness, artifact hashes, zip export, and handoff integration.
- v1.1.0: section-aware candidate provenance, caption indexes, and high-risk candidate visibility.
- v1.2.0: experiment specs, spec validation, run spec snapshots, and spec hashes.
See ROADMAP.md for details.
OpenRepro-Agent v1.2.0 is an engineering scaffold for reproducibility workflows. It should not be used to claim that a paper has been reproduced unless the user has independently verified formulas, parameters, code, data, and outputs.
This project must not fabricate:
- benchmark results
- user counts
- token usage
- cost estimates
- accuracy or efficiency improvements
- claims that a lightweight demo is a complete paper reproduction
Only actual generated artifacts should be reported.