GapForge v1.0 is a stable, evidence-gated research ideation OS. It turns a broad topic into auditable research state: papers, notes, claims, evidence, gaps, novelty dossiers, experiment plans, benchmark records, manuscript packages, reviewer objections, release gates, and reports.
GapForge v2 is the Idea Discovery Engine line. v2 searches across topic portfolios, generates and mutates idea candidates, expands constructive gaps and cross-domain transfers, runs novelty/counterevidence loops and idea tournaments, captures human preference feedback, and measures idea yield. The v2 principle is simple: GapForge should try much harder to find a good idea, but it must not force a bad one.
GapForge v1.0 supports:
- literature campaigns
- Codex/GPT-5.4 assisted research workflows
- novelty and prior-work gates
- experiment execution
- benchmark and replication packages
- manuscript and artifact evaluation packages
- correct refusal outcomes
- release gates and compatibility audits
GapForge v2 planning docs:
- v2 roadmap
- v2 acceptance criteria
- v2 idea discovery
- v2 idea yield metrics
- v2 research agenda mode
- v2 human feedback
- v2.1 selected idea execution roadmap
- v2.1 selected idea
- v2.1 benchmark specification
- v2.1 acceptance criteria
- v2.2 pilot benchmark roadmap
- v2.2 pilot benchmark specification
- v2.2 acceptance criteria
- v2.2 power and sample-size plan
- v2.2 reviewer blockers
- v2.3 main-scale benchmark roadmap
- v2.3 main benchmark specification
- v2.3 related-work completion plan
- v2.3 publication-readiness gate
- v2.3 acceptance criteria
GapForge is intentionally skeptical. It can recommend a direction only when evidence gates support one, and it can refuse when literature coverage, novelty evidence, or empirical support is insufficient. It is not an exhaustive autonomous literature reviewer, and it must not fabricate citations, experimental results, venue acceptance, publication readiness, or novelty claims. Deterministic and offline-safe paths remain the default.
- v0.1: deterministic research OS foundation with run state, source connectors, skill orchestration, claim ledger, novelty gate, evals, and reports.
- v0.2: full-text evidence and novelty dossier upgrade with PDF artifacts, sections, evidence spans, source coverage, citation graph, gap evidence matrices, human review, and strict reports.
- v0.3: semantic plus LLM-assisted plus multi-run project-memory upgrade with hybrid retrieval, source policy profiles, active loop decisions, related-work matrices, direction maturation, protocols, review queues, dashboard, and paper packages.
- v0.4: actual Codex/GPT-5.4 agentic campaign release path with campaign-level state, task packs, direct/handoff runner support, strict validated import, repair/rollback, campaign dashboards/reports, v4 evals, and release gates. Fake-agent success still does not count as real-run acceptance.
- v0.5: real literature campaign quality layer. v0.5 validates multi-step live-literature campaigns, source coverage quality, closest-prior-work recall, real-paper citation grounding, expert review, experiment protocol quality, and correct rejection behavior when novelty is weak.
- v0.6: experiment execution and empirical validation layer. v0.6 distinguishes protocols, scaffolds, smoke runs, pilot/main runs, failed/negative experiments, statistical analyses, reproducibility checks, and artifact-backed empirical claims.
- v0.7: real benchmark execution and replication layer. v0.7 distinguishes fixture smoke, local benchmark, full benchmark, failed jobs, underpowered runs, benchmark comparison, replication packages, and real/local public benchmark validation.
- v0.8: manuscript, artifact evaluation, and reviewer-rebuttal layer. v0.8 distinguishes manuscript-ready, submission-ready, and camera-ready while requiring claim, citation, result, artifact, blinding, and human-review traceability.
- v0.9: external pilot and v1-readiness layer. v0.9 runs one real topic through the full workflow and accepts either a defensible direction or an evidence-backed refusal, with migration, CLI, docs, artifact hygiene, external feedback, and v1-readiness gates.
- v0.9.1: migration remediation patch. v0.9.0 was not v1-ready because the migration/backward compatibility audit failed; v0.9.1 adds historical fixtures, versioned migrators, backup snapshots, and compatibility audit v2 without adding research features.
- v1.0.0: stable evidence-gated release. v1.0.0 passes the v1 readiness gate after v4-v9 evidence, compatibility audit v2, CLI/docs audits, artifact hygiene checks, and an accepted correct-refusal external pilot.
- v2.0.0: Idea Discovery Engine release line. v2 actively searches across topic portfolios for defensible idea candidates using mutation, constructive gap creation, cross-domain transfer, Codex/GPT-5.4 synthesis tasks, active search control, novelty/counterevidence loops, idea tournaments, human feedback, research agenda fallback, and idea yield metrics. The v2 gate requires at least one accepted idea candidate or explicit idea-discovery failure routed to v2.0.1 or v2.1 planning.
- v2.0.1: CI fixture tracking patch. v2.0.1 keeps historical migration fixture JSONs available in clean checkouts without changing product behavior, idea-discovery claims, or the v2.0 selected idea.
- v2.1.0: Selected Idea Execution release. v2.1 freezes the accepted v2.0 idea
idea-sequential-specificity-benchmark-for-low-fpr-collusion-auditsand turns it into a benchmark specification, threat model, task/data generator, baseline suite, sequential specificity metrics, power plan, experiment workspace, smoke benchmark run, result analysis, reviewer critique, and manuscript package update. The smoke path is runnable and artifact-backed, but remains synthetic and underpowered. - v2.2.0: Pilot-Scale Benchmark Study release. v2.2 requires a pilot run manifest, expanded honest-agent null distribution, expanded collusive-agent alternatives, stronger hard negatives, full baseline monitor suite, pilot result artifacts, sequential low-FPR uncertainty analysis, prior-work and related-work attachment, reviewer-blocker classification, and a pilot manuscript package. It must not claim deployment validity, unsupported
alpha=0.001operational specificity, main benchmark maturity, or publication readiness from synthetic pilot data. - v2.3.0: Main-Scale Benchmark and Publication-Readiness Upgrade. v2.3 directly addresses v2.2 blockers by requiring a main-scale sample-size plan, an explicit
alpha=0.001power-or-drop decision, real prior-work records for required categories, related-work matrix completion, stronger baselines, main dataset generation or no-go, main benchmark execution or no-go, publication-readiness reviewer panel, manuscript package upgrade, and a clear go/no-go recommendation.
A summarizer compresses papers. GapForge tracks beliefs, evidence, uncertainty, and rejected ideas.
GapForge keeps:
- durable run and project directories
- normalized paper metadata, local artifacts, parsed sections, and evidence spans
- a claim ledger with support, counterevidence, confidence, and verification state
- source coverage and stopping assessments
- closest-prior-work novelty dossiers
- human review records and review queues
- research directions that mature from seed to manuscript-ready or rejected
- final reports that distinguish evidence-backed claims from hypotheses
The system is intentionally skeptical. It should attack an idea before recommending it, and it should not claim novelty until closest prior work and missing searches are explicit.
Requires Python 3.11+.
git clone <repo-url> GapForge
cd GapForge
python -m venv .venv
source .venv/bin/activate
make installEquivalent:
python -m pip install -e ".[dev]"Run a deterministic, offline-safe v0.3 smoke workflow:
export GAPFORGE_DISABLE_NETWORK=1
gapforge run "low false positive collusion detection" --v3 --max-papers 8 --build-index
gapforge report --strict
gapforge coverageRun the full project-level offline v0.3 smoke target:
make v3-smokemake v3-smoke initializes a temporary project, runs the staged v0.3 path offline with a small budget, builds the project retrieval index, writes the project report, and generates the static dashboard. It checks for project_report.md, source_coverage.md, final_report.md, direction_maturity_report.md, related_work_matrix.md, review_queue.md, and dashboard/index.html. Active-loop decisions are generated only by gapforge run --v3 --active and are written to active_decisions.md.
Expected behavior:
- network PDF/search expansion steps skip with coverage warnings
- retrieval index artifacts are built from fallback/offline state
- strict report remains conservative and may refuse a top direction
- outputs are smoke-test artifacts, not real literature conclusions
v0.4 introduces project-level campaigns. Campaigns coordinate search, source coverage, retrieval, Codex task packs, validated imports, novelty loops, reviewer loops, experiment protocols, code-task handoff, dashboards, and human acceptance. For the complete Codex command path, see docs/CODEX_QUICKSTART.md.
This is the safest first run. It does not use Codex/GPT-5.4 and does not count as actual-run acceptance.
export GAPFORGE_DISABLE_NETWORK=1
gapforge init-project "v4 deterministic campaign"
gapforge campaign-create "low false positive collusion detection" --project-id v4-deterministic-campaign
gapforge campaign-run --campaign-id <campaign-id> --mode deterministic --max-iterations 3
gapforge campaign-report --campaign-id <campaign-id>
gapforge dashboard --project-id v4-deterministic-campaign --include-actual-runsFake-agent mode is CI-safe. It exercises task-pack, schema, validation, import, and campaign-state plumbing. It is never evidence that Codex/GPT-5.4 worked.
export GAPFORGE_DISABLE_NETWORK=1
gapforge campaign-canary-run --profile fake_agent_campaign_regression
gapforge eval --v4 --write-reportUse this when no stable direct Codex runner is configured. It can count as real actual-run evidence only after Codex/GPT-5.4 writes outputs, GapForge validates/imports them, a human attests the task, and campaign review accepts the result.
export GAPFORGE_ENABLE_REAL_RUNS=1
export GAPFORGE_AGENT_MODE=task-pack
export GAPFORGE_AGENT_NAME=codex
export GAPFORGE_CODEX_MODEL=gpt-5.4
gapforge setup-codex
gapforge campaign-canary-run --profile single_task_codex_handoff --real
gapforge latest-codex-task --campaign-id <campaign-id>
gapforge codex-handoff --task-id <task-id> --print-prompt
# Run Codex/GPT-5.4 manually and write JSON outputs into outputs/.
gapforge validate-import-all --task-id <task-id>
gapforge attest-agent-run --task-id <task-id> --agent codex --model gpt-5.4 --method task_pack --attester "<name>"
gapforge campaign-review --campaign-id <campaign-id> --accept --reviewer "<name>"
gapforge actual-run-status --campaign-id <campaign-id>If validation fails, repair without weakening validation:
gapforge codex-doctor --task-id <task-id>
gapforge repair-agent-output --task-id <task-id> --latest-invalid --handoff --print-prompt
gapforge validate-repair-output --repair-id <repair-id>
gapforge import-repair-output --repair-id <repair-id>Use direct mode only when a trusted local Codex command runner is configured. Preview it before execution.
export GAPFORGE_ENABLE_REAL_RUNS=1
export GAPFORGE_AGENT_MODE=direct
export GAPFORGE_AGENT_NAME=codex
export GAPFORGE_CODEX_MODEL=gpt-5.4
export GAPFORGE_CODEX_COMMAND='codex run --model {model} --task-pack {task_pack} --output-dir {outputs_dir}'
gapforge setup-codex
gapforge codex-command-preview --task-id <task-id>
gapforge codex-run --task-id <task-id> --direct --dry-run
gapforge codex-run --task-id <task-id> --direct
gapforge validate-import-all --task-id <task-id>
gapforge campaign-review --campaign-id <campaign-id> --accept --reviewer "<name>"Campaign completion is not acceptance. Acceptance requires validated imports, actual-run attestation when task-pack/manual-handoff is used, source coverage, explicit stop reason, campaign report, and human review. Fake-agent campaigns are excluded.
gapforge campaign-review --campaign-id <campaign-id> --accept --reviewer "<name>"
gapforge campaign-acceptance --campaign-id <campaign-id>
gapforge v4-release-gate --project-id <project-id> --write-report
gapforge v4-release-gate --explain
gapforge v4-release-gate --next-commandsgapforge v4-release-gate requires deterministic CI evidence, a passed fake-agent campaign canary, at least three accepted real Codex/GPT-5.4 campaigns, one experiment-ready campaign, one strict-refusal campaign, and one manual-PDF/full-text campaign. If those artifacts are missing, the gate fails closed.
Current verification status from the May 6, 2026 local pass:
- Deterministic checks passed:
make format,make format-check,make lint,make typecheck,make test,make eval,gapforge eval --v2 --write-report,gapforge eval --v3 --write-report,gapforge eval --v4 --write-report,make coverage,make v2-smoke,make v3-smoke, andmake v4-smoke. - Fake-agent campaign canary passed with valid schema validation.
- Direct Codex/GPT-5.4 execution was available through the local
codexCLI. - Three real Codex/GPT-5.4 workflow canaries were validated/imported, attested, and human-reviewed: one refusal canary, one manual-PDF/full-text reading canary, and one fixture-backed experiment-ready canary.
gapforge v4-release-gate --write-report --jsonpassed with 3 accepted real campaigns and no blockers. These are release-gate workflow canaries, not evidence of exhaustive literature-review quality.
v0.4.1 is scoped to Codex workflow reliability. The main commands are:
gapforge setup-codex
gapforge codex-handoff --task-id <task-id> --print-prompt
gapforge codex-doctor --task-id <task-id>
gapforge validate-import-all --task-id <task-id>
gapforge repair-agent-output --task-id <task-id> --latest-invalid --handoff --print-promptThe v0.4.1 golden path is the single-task handoff canary:
export GAPFORGE_ENABLE_REAL_RUNS=1
export GAPFORGE_AGENT_MODE=task-pack
export GAPFORGE_AGENT_NAME=codex
export GAPFORGE_CODEX_MODEL=gpt-5.4
gapforge setup-codex
gapforge campaign-canary-run --profile single_task_codex_handoff --real
gapforge codex-handoff --task-id <task-id> --print-prompt
# Run Codex/GPT-5.4 and write JSON outputs to outputs/.
gapforge validate-import-all --task-id <task-id>
gapforge attest-agent-run --task-id <task-id> --agent codex --model gpt-5.4 --method task_pack --attester "<name>"
gapforge campaign-review --campaign-id <campaign-id> --accept --reviewer "<name>"
gapforge campaign-canary-complete --canary-id <canary-id>See docs/CODEX_QUICKSTART.md, docs/V0_4_1_CODEX_FIX_PLAN.md, docs/V0_4_1_CODEX_ACCEPTANCE.md, and docs/V0_4_1_CODEX_TROUBLESHOOTING.md.
v0.4.1 proves the actual Codex/GPT-5.4 workflow path. It does not prove broad live-literature research quality. v0.5 is scoped to that next step.
v0.5 acceptance must require live-literature campaigns that:
- search live source connectors and record every query/failure
- satisfy field-specific source policy or refuse recommendation
- find and cite closest prior work, or explicitly mark novelty unknown
- ground real-paper claims in paper IDs and EvidenceSpan locators when full text is available
- build related-work matrices and experiment protocols for recommended directions
- undergo human expert review
- reject or downgrade weak novelty, poor coverage, and unsupported claims
The v0.5 docs are:
docs/V0_5_ROADMAP.mddocs/V0_5_ACCEPTANCE_CRITERIA.mddocs/V0_5_REAL_LITERATURE_CAMPAIGNS.mddocs/V0_5_QUALITY_GATES.mddocs/V0_5_LIVE_SOURCE_POLICY.md
Normal CI remains deterministic and offline. Fixture-only canaries and fake-agent campaigns do not count as v0.5 live-literature quality.
Latest v0.5 validation status: gapforge v5-release-gate --write-report --json passed locally on May 6, 2026. The accepted quality campaigns were one experiment-ready low-FPR collusion campaign and one conservative refusal campaign. This is live-literature release-gate acceptance, not a claim of exhaustive literature review.
v0.5 can say a direction is experiment-ready. v0.6 says whether an experiment actually ran, what artifacts it produced, how results were analyzed, whether the run is reproducible, and whether empirical claims are supported by result artifacts.
The v0.6 docs are:
docs/V0_6_ROADMAP.mddocs/V0_6_ACCEPTANCE_CRITERIA.mddocs/V0_6_EXPERIMENT_EXECUTION.mddocs/V0_6_EMPIRICAL_VALIDATION.mddocs/V0_6_REPRODUCIBILITY_POLICY.mddocs/releases/v0.6.0.mddocs/releases/v0.6.0-empirical-validation.md
Core boundary:
- an experiment protocol is not an executed experiment
- a scaffold is not an executed experiment
- a smoke run validates wiring only; it is not empirical success
- pilot/main results require execution records and result artifacts
- failed or negative experiments are first-class results
- no empirical claim is supported unless a run record and result artifact exist
- generated paper packages must separate real results from placeholders and hypothetical expected results
Minimal local workflow:
gapforge experiment-workspace-create --project-id <project-id> --direction-id <direction-id>
gapforge dataset-register --workspace-id <workspace-id> --name "fixture examples" --path data/fixture.csv --dataset-type fixture --license MIT
gapforge baseline-register --workspace-id <workspace-id> --name "heuristic baseline" --baseline-type heuristic --implementation-path code/src/baselines.py
gapforge metric-register --workspace-id <workspace-id> --name "false positive rate"
gapforge scaffold-experiment-code --workspace-id <workspace-id>
gapforge experiment-manifest-create --workspace-id <workspace-id> --run-type smoke --command "python code/src/run_experiment.py" --expected-output results/smoke_metrics.json --random-seed 123
gapforge experiment-run --workspace-id <workspace-id> --manifest-id <manifest-id>
gapforge parse-results --execution-id <execution-id>
gapforge analyze-results --execution-id <execution-id>
gapforge reproducibility-check --execution-id <execution-id>
gapforge empirical-review --execution-id <execution-id>
gapforge export-paper-package-v2 --workspace-id <workspace-id>
gapforge v6-release-gate --write-report --jsonCodex/GPT-5.4 can implement experiment code through constrained workspace code tasks:
gapforge experiment-code-task --workspace-id <workspace-id> --type implement_metric
gapforge experiment-code-handoff --task-id <task-id>
gapforge experiment-code-import --task-id <task-id>Codex code tasks are bounded to experiment_workspaces/<workspace-id>/code/. They may create code, tests, configs, and analysis scripts; they must not create fake result metrics, claim an experiment ran, or edit evidence/claim state outside the validated import path.
The v0.6 release gate requires at least one executed fixture experiment and one failed or negative experiment path, while keeping CI free of expensive experiment execution.
Latest v0.6 validation status: fixture experiment execution passed locally on May 6, 2026, and gapforge v6-release-gate --write-report --json passed. One real Codex/GPT-5.4 experiment-code task was executed and imported as a validated workspace-bounded implementation patch. No real-world main experiment was run, and no real empirical result is claimed.
v0.6 validates experiment execution mechanics. v0.7 validates benchmark execution and replication plumbing: external data policies, compute environments, sweeps, ablations, comparison tables, error/slice analysis, low-FPR power checks, and replication packages.
The v0.7 docs are:
docs/V0_7_ROADMAP.mddocs/V0_7_ACCEPTANCE_CRITERIA.mddocs/V0_7_REAL_BENCHMARKS.mddocs/V0_7_COMPUTE_AND_REPLICATION.mddocs/V0_7_BENCHMARK_RELEASE_GATE.mddocs/releases/v0.7.0.mddocs/releases/v0.7.0-benchmark-replication.md
Core boundary:
- fixture smoke validates wiring only
- local benchmark may count when it uses real or benchmark-like non-fixture data with complete artifacts
- full benchmark requires intended data, baselines, metrics, compute records, and analysis
- replication requires rerunnable package instructions, checksums/version IDs, expected artifacts, and review
- no GPU, cluster, live source, or external download is required in normal CI
- no large dataset should be downloaded without explicit user approval
- no benchmark success should be claimed from fixture runs
v0.7 must preserve v5 literature gates and v6 empirical claim gates. Benchmark output should become a supported claim only when execution records, result artifacts, statistical analysis, and review justify it.
Minimal fixture benchmark smoke:
export GAPFORGE_DISABLE_NETWORK=1
gapforge benchmark-canary-run --profile fixture_benchmark_success
gapforge benchmark-canary-run --profile fixture_benchmark_failure
gapforge benchmark-canary-run --profile low_fpr_underpowered_benchmark
gapforge benchmark-canary-run --profile replication_package_canary
gapforge v7-release-gate --write-report --jsonOpt-in real/local public benchmark validation:
gapforge benchmark-canary-run --profile local_public_small_benchmark --real --accept-download
gapforge v7-release-gate --write-report --json --claim-realLatest v0.7 validation status: fixture benchmark gate and opt-in real/local public benchmark gate passed locally on May 7, 2026. The real benchmark canary used explicit dataset consent, a cached public UCI Iris download, artifact-backed metrics and predictions, benchmark comparison, error analysis, and replication package verification. This validates the benchmark-and-replication path; it is not a SOTA claim, a large-scale benchmark study, a GPU/cluster validation, or independent replication by another researcher.
v0.8 is the manuscript release boundary. It turns project, literature, experiment, benchmark, and replication state into auditable manuscript and artifact-evaluation packages without hiding missing work.
The v0.8 docs are:
docs/V0_8_ROADMAP.mddocs/V0_8_ACCEPTANCE_CRITERIA.mddocs/V0_8_MANUSCRIPT_WORKFLOW.mddocs/V0_8_ARTIFACT_EVALUATION.mddocs/V0_8_REBUTTAL_WORKFLOW.mddocs/releases/v0.8.0.mddocs/releases/v0.8.0-manuscript-submission.md
Core boundary:
- manuscript-ready means a draft package can be assembled, possibly with visible blockers
- submission-ready means the venue profile, traceability, citations, results, artifact package, blinding, reviewer-objection, and human-review gates pass
- camera-ready means a post-acceptance checklist has been completed after explicit acceptance metadata
- manuscript claims must trace to claim ledger entries, evidence spans, result artifacts, benchmark records, reviewer decisions, and citations
- figures and tables must be generated from recorded result artifacts or explicitly labeled conceptual placeholders
- artifact evaluation packages must be generated from replication and workspace state
- rebuttal plans must answer reviewer objections with evidence, changes, experiments, or honest concessions
- normal CI must remain deterministic, offline, and free of LaTeX, GPU, cluster, live-source, large-download, and live-LLM requirements
v0.8 must not fabricate results, citations, BibTeX, novelty, rebuttal evidence, venue acceptance, or camera-ready status. It must not mark a manuscript submission-ready when novelty, result, reproducibility, artifact, citation, blinding, or human-review gates fail.
v0.9 is the first real external pilot release. It should run GapForge on one real topic from broad framing through live literature, novelty review, protocol, execution path, artifact package, manuscript draft, reviewer/rebuttal plan, external feedback, and v1 readiness assessment.
The recommended pilot topic is low false-positive collusion detection in LLM multi-agent systems.
The v0.9 docs are:
docs/V0_9_ROADMAP.mddocs/V0_9_ACCEPTANCE_CRITERIA.mddocs/V0_9_EXTERNAL_PILOT.mddocs/V0_9_V1_READINESS.mddocs/V0_9_1_MIGRATION_REMEDIATION.mddocs/V1_MIGRATION_AND_COMPATIBILITY.mddocs/V0_9_FAILURE_MODES.mddocs/pilots/low_fpr_collusion/PILOT_SPEC.mddocs/pilots/low_fpr_collusion/ACCEPTANCE_CRITERIA.mddocs/pilots/low_fpr_collusion/REVIEW_CHECKLIST.md
Core boundary:
- v0.9 may pass with a defensible direction or an evidence-backed refusal
- fixture-only results validate workflow mechanics only
- no research direction should be forced when novelty, coverage, or feasibility is weak
- small real runs require run records, logs, result artifacts, analysis, and review before supporting empirical claims
- external reviewer or pilot feedback is required
- migration/backward compatibility from v0.1 through v0.9 state must be audited with compatibility audit v2
- CLI/docs usability and artifact hygiene issues found during the pilot must be fixed, scoped, or recorded
- v1 cannot be claimed until the v1 readiness gate passes
Suggested pilot path:
gapforge pilot-list
gapforge pilot-spec --name low_fpr_collusion
gapforge pilot-run --name low_fpr_collusion
gapforge pilot-status --name low_fpr_collusion
gapforge real-campaign-dry-run --profile live_low_fpr_collusion --write-report
gapforge live-source-diagnostic --topic "low false-positive collusion detection in LLM multi-agent systems" --source-profile ai_safety --write-report
gapforge real-literature-run --profile live_low_fpr_collusion
gapforge campaign-report --campaign-id <campaign-id>
gapforge real-literature-review --campaign-id <campaign-id> --reviewer "<expert>"
gapforge real-literature-acceptance --campaign-id <campaign-id>
gapforge idea-gate --pilot-id <pilot-id>
gapforge pilot-outcome --pilot-id <pilot-id>
gapforge pilot-review --pilot-id <pilot-id> --accept-outcome --reviewer-role external_reviewer
gapforge pilot-report --pilot-id <pilot-id>
gapforge compatibility-audit --v2 --write-report
gapforge migrate-all --dry-run
# Review the planned MigrationRecord output before applying.
gapforge migrate-all --apply
gapforge migration-report
gapforge cli-audit --write-report
gapforge docs-audit --write-report
gapforge v1-readiness --write-report --jsonIf the v0.9 CLI surface is incomplete, the pilot must record the gap as a CLI cleanup blocker rather than pretending the command exists in a validated release path.
v1 readiness must not be claimed from the pilot alone. The v2 compatibility audit must pass, and warning-only generated local artifacts are acceptable only when they are ignored and not curated release evidence.
Use v0.5 when you want GapForge to assess whether a real literature campaign is research-useful, not merely whether the workflow ran.
gapforge real-literature-profiles
gapforge real-campaign-dry-run --profile live_low_fpr_collusion --write-report
gapforge source-health --topic "low false positive collusion detection" --source arxiv
gapforge live-source-diagnostic --topic "low false positive collusion detection in LLM agents" --source-profile ai_safety --write-report
gapforge real-literature-run --profile live_low_fpr_collusion
gapforge real-literature-status --record-id <record-id>The dry run does not contact live sources or spend Codex budget. It previews expected source checks, search rounds, Codex tasks, artifacts, likely blockers, commands, and acceptance requirements.
Plan searches before asking Codex to synthesize:
gapforge plan-search-strategy "low false positive collusion detection in LLM agents" --source-profile ai_safety
gapforge execute-search-strategy --run-id <run-id> --strategy-id <strategy-id>
gapforge search-rounds --run-id <run-id>
gapforge canonicalize-papers --run-id <run-id>
gapforge prior-work-recall --run-id <run-id> --gap-id <gap-id>If GAPFORGE_DISABLE_NETWORK=1, source health and execution commands must report disabled/skipped state. That can support offline smoke testing, but not live-literature acceptance.
Workflow acceptance and research-quality acceptance are separate:
gapforge campaign-report --campaign-id <campaign-id>
gapforge real-literature-review --campaign-id <campaign-id>
gapforge real-literature-review --campaign-id <campaign-id> --accept-quality --reviewer "<expert>"
gapforge real-literature-acceptance --campaign-id <campaign-id>
gapforge v5-release-gate --project-id <project-id> --write-reportHuman reviewers should reject campaigns with fake citations, unsupported high-confidence claims, obvious missed prior work, overclaimed novelty, poor source coverage hidden by report language, or experiment protocols lacking baselines/metrics/falsification.
The same workflow is scriptable without shelling out:
from gapforge import api
diagnostic = api.source_health("low false positive collusion", profile="ai_safety")
strategy = api.plan_search_strategy("low false positive collusion", "ai_safety")
record = api.run_real_literature_campaign("live_low_fpr_collusion")
gate = api.v5_release_gate()API calls are thin wrappers over the same managers used by the CLI. Tests can pass mocked sources; live network is never required for CI.
Use project memory when several runs belong to one research program:
gapforge init-project "monitoring collusion research"
gapforge run "low false positive collusion detection" --v3 --project-id monitoring-collusion-research --max-papers 20
gapforge sync-project-memory --project-id monitoring-collusion-research
gapforge project-status --project-id monitoring-collusion-research
gapforge project-report --project-id monitoring-collusion-researchProject memory deduplicates papers across runs, preserves rejected ideas and human decisions, and stores research directions that can mature over time. Prior project memory is context, not newly verified evidence.
Build and inspect a local hybrid lexical/semantic index:
gapforge build-index --run-id <run-id>
gapforge search-index --run-id <run-id> "low false positive evaluation benchmark" --top-k 20
gapforge explain-retrieval --run-id <run-id> "closest prior work low FPR"For project memory:
gapforge build-index --project-id <project-id>
gapforge search-index --project-id <project-id> "monitor evasion limitation" --top-k 20The default semantic path uses deterministic local hash embeddings. Live embedding APIs are not required.
LLM-backed skills are opt-in. Tests and default runs do not require live model calls.
export GAPFORGE_LLM_MODE=prompt-pack # off|prompt-pack|fake|provider
gapforge prompt-pack --run-id <run-id> --skill deep-reading
gapforge read-llm --run-id <run-id> --tier 1 --dry-run-promptsFake mode is deterministic:
export GAPFORGE_LLM_MODE=fake
gapforge read-llm --run-id <run-id> --tier 1 --fake
gapforge mine-gaps-llm --run-id <run-id> --fake
gapforge novelty-check-llm --run-id <run-id> --all --fakeProvider mode is optional and must validate JSON before state is updated. Model outputs must cite paper IDs and EvidenceSpan locators when source-backed. Hidden chain-of-thought must not be requested or stored.
GapForge separates deterministic runs, fake-agent tests, task-pack handoff, and direct Codex execution:
gapforge run "topic" --v3 --mode deterministic
GAPFORGE_DISABLE_NETWORK=1 gapforge campaign-canary-run --profile fake_agent_campaign_regression
gapforge campaign-canary-run --profile single_task_codex_handoff --real
gapforge codex-handoff --task-id <task-id> --print-prompt
gapforge codex-run --task-id <task-id> --direct --dry-runDeterministic mode is the default. Fake-agent mode is CI-safe and validates task-pack/schema plumbing. Task-pack handoff can count only after real Codex/GPT-5.4 output, validation, import, attestation, and human review. Direct mode can count only after valid output/import and human review.
GapForge separates automated validation from actual research-agent validation:
- Level 0 deterministic tests: no LLM, CI-safe, validates code, schemas, persistence, reports, and evals.
- Level 1 offline smoke tests: no LLM and no network, validates orchestration safety.
- Level 2 fake LLM tests: fake model only, validates JSON guards and evidence gates.
- Level 3 prompt-pack/task-pack dry runs: no live calls, validates Codex/GPT-5.4 prompts and schemas.
- Level 4 Codex/GPT-5.4 canary runs: real model/agent, private/manual, validates actual LLM-assisted research skills.
- Level 5 human-reviewed acceptance: human review of canary outputs and recorded decisions.
CI must not require Codex/GPT-5.4. Actual Codex/GPT-5.4 assisted runs are required before a release can claim real-run validation, but they are not part of normal automated tests. If Codex/GPT-5.4 is unavailable, actual-run validation has not passed. Never fake a canary pass.
For v0.4, the bar is higher: fake-agent tests and prompt-pack dry runs remain necessary but are not sufficient. v0.4 actual-run acceptance requires multiple real Codex/GPT-5.4 agentic campaigns with validated imports, campaign artifacts, strict-report checks, and human acceptance reviews. A task-pack handoff may support acceptance only after real Codex/GPT-5.4 outputs are imported, validated, attested, and reviewed.
See:
docs/CODEX_QUICKSTART.mddocs/V0_3_REAL_RUN_ACCEPTANCE.mddocs/V0_3_CANARY_RUNS.mddocs/V0_4_ROADMAP.mddocs/V0_4_ACCEPTANCE_CRITERIA.mddocs/V0_4_REAL_RUN_ACCEPTANCE.mddocs/V0_4_AGENTIC_CAMPAIGNS.mddocs/CODEX_RESEARCH_AGENT.mddocs/REAL_RUN_REVIEW_CHECKLIST.mddocs/releases/v0.3.0-real-run-acceptance.mddocs/releases/v0.4.0-real-run-acceptance.mddocs/releases/v0.4.0.md
Latest v0.4 validation status: deterministic checks, the fake-agent campaign canary, and the v0.4 actual-run release gate passed locally on May 6, 2026. The accepted real Codex/GPT-5.4 campaigns were workflow canaries with conservative/fixture-backed outputs; they validate the actual-run path, not exhaustive autonomous research quality.
Legacy paper-package exports are starter kits, not finished papers:
gapforge create-direction --project-id <project-id> --gap-id <gap-id>
gapforge related-work-matrix --project-id <project-id> --direction-id <direction-id>
gapforge mature-direction --project-id <project-id> --direction-id <direction-id>
gapforge experiment-protocol --project-id <project-id> --direction-id <direction-id>
gapforge review-panel --project-id <project-id> --direction-id <direction-id>
gapforge export-paper-package --project-id <project-id> --direction-id <direction-id>Exports include outlines, related-work matrix, protocol, limitations, reviewer objections, rebuttal plan, bibliography, claim ledger, and evidence index. They must not invent results.
v0.8 extends this into a first-class manuscript workflow:
gapforge manuscript-create --project-id <project-id> --direction-id <direction-id> --workspace-id <workspace-id> --title "Paper title"
gapforge bibliography-build --manuscript-id <manuscript-id>
gapforge manuscript-traceability --manuscript-id <manuscript-id>
gapforge manuscript-table --manuscript-id <manuscript-id> --type result_table
gapforge manuscript-figure --manuscript-id <manuscript-id> --type metric_plot
gapforge manuscript-set-venue --manuscript-id <manuscript-id> --venue generic_conference
gapforge submission-checklist --manuscript-id <manuscript-id>
gapforge anonymize-manuscript --manuscript-id <manuscript-id>
gapforge artifact-eval-package --manuscript-id <manuscript-id>
gapforge manuscript-review --manuscript-id <manuscript-id>
gapforge rebuttal-plan --manuscript-id <manuscript-id>
gapforge submission-package --manuscript-id <manuscript-id> --type review
gapforge dashboard --manuscript-id <manuscript-id>
gapforge v8-release-gate --write-report --jsonReadiness terms are strict:
- Manuscript-ready: a durable draft exists and can be inspected, but blockers may remain.
- Review-ready: internal reviewer and artifact checks have run and remaining issues are visible.
- Submission-ready: the venue-aware checklist, traceability, citation validity, result artifacts, artifact evaluation package, anonymization if required, and unsupported-claim gates pass.
- Camera-ready: post-review/rebuttal blockers are addressed. It does not imply venue acceptance unless an external acceptance record exists.
Manuscript citations must resolve to known Paper records. Result claims must link execution/result artifacts. Reviewer rebuttals must cite evidence or request fixes; they must not invent answers.
v2.1 is the Selected Idea Execution release for the accepted v2.0 candidate:
- Idea ID:
idea-sequential-specificity-benchmark-for-low-fpr-collusion-audits - Title:
Sequential specificity benchmark for low-FPR collusion audits
The v2.1 path is benchmark-first. It should define and smoke-run a concrete research artifact before any stronger claim is allowed:
- benchmark specification
- threat model and observability assumptions
- honest-agent baseline distribution
- collusive-agent scenario distribution
- hard-negative benign scenarios
- sequential audit protocol
- low-FPR specificity metrics
- honest baseline monitors
- power and sample-size plan
- experiment workspace and runnable smoke manifest
- result artifacts, analysis, reviewer critique, and manuscript package update
v2.1 must distinguish smoke, pilot, and main benchmark levels. Smoke runs prove only that the benchmark path runs and writes artifacts. They do not prove real-world collusion benchmark validity, final scientific results, method superiority, or publication readiness.
Current maturity statement: v2 found candidate idea idea-sequential-specificity-benchmark-for-low-fpr-collusion-audits; v2.1 executes that selected idea as a benchmark artifact workflow. The synthetic smoke benchmark is not a final research result, low-FPR claims require power, and benchmark validity limitations remain open until pilot/main evidence and review exist.
Scriptable v2.1 path:
from gapforge import api
selected_project = api.create_selected_idea_project("<accepted-idea-id>")
spec = api.create_selected_benchmark_spec(selected_project.project_id)
api.generate_traces(spec.id, count=100, split="smoke")
api.create_monitor_baselines(spec.id)
smoke = api.run_selected_benchmark_smoke(spec.id)
api.compute_sequential_metrics(smoke.dataset_id)
api.selected_benchmark_review(spec.id)
api.selected_benchmark_manuscript(spec.id)
api.v21_release_gate(write_report=True)Next steps toward pilot/main benchmark: expand the honest null and collusive alternative distributions, preregister target alpha levels and sample sizes, run all required baselines, resolve reviewer blockers, and keep smoke versus pilot versus main status visible in dashboards and release notes.
Planning docs:
- docs/V2_1_ROADMAP.md
- docs/V2_1_SELECTED_IDEA.md
- docs/V2_1_BENCHMARK_SPEC.md
- docs/V2_1_ACCEPTANCE_CRITERIA.md
v2.2 is the pilot-scale release for the same locked selected idea. It must require a pilot run, not another smoke-only package:
- powered pilot sample-size plan
- expanded honest-agent null distribution
- expanded collusive-agent alternative distribution
- stronger hard-negative traces
- full baseline monitor suite
- locked pilot run manifest
- pilot result artifacts
- sequential low-FPR analysis with uncertainty
- prior-work recall and related-work matrix attachment
- reviewer blocker resolution or preservation
- pilot manuscript package
- no pilot/main overclaiming
v2.2 must keep smoke, pilot, and main benchmark maturity separate. A synthetic pilot can support only the low-FPR claims that its sample size, sequential correction, and uncertainty report support. It must not claim deployment validity, unsupported alpha=0.001 operational specificity, real deployment evidence, main benchmark maturity, or publication readiness with unresolved prior-work, related-work, baseline, power, or reviewer gates.
Scriptable v2.2 path:
from gapforge import api
api.create_pilot_power_plan("<benchmark-id>")
api.generate_honest_null("<benchmark-id>", count=300)
api.generate_collusive_alternatives("<benchmark-id>", count=150)
dataset = api.build_pilot_dataset("<benchmark-id>", negative_count=300, positive_count=150)
api.calibrate_pilot_baselines("<benchmark-id>", target_alpha=0.01)
execution = api.run_selected_pilot("<benchmark-id>", dataset_id=dataset.id)
api.analyze_selected_pilot(execution.id)
api.selected_pilot_review("<benchmark-id>")
api.selected_pilot_manuscript("<benchmark-id>")
api.v22_release_gate(write_report=True)Inspectable dashboard:
gapforge dashboard --project-id <selected-project-id> --include-selected-pilotv2.2 is not the end state. v2.3 remains responsible for main-scale evidence: larger honest/null counts for unsupported alpha targets, stronger external scenario review, more competitive baselines, reproducible main-run manifests, and publication-readiness only if prior-work, related-work, power, and reviewer gates all pass.
Planning docs:
- docs/V2_2_ROADMAP.md
- docs/V2_2_PILOT_BENCHMARK.md
- docs/V2_2_ACCEPTANCE_CRITERIA.md
- docs/V2_2_POWER_AND_SAMPLE_SIZE.md
- docs/V2_2_REVIEWER_BLOCKERS.md
v2.3 is the main-scale upgrade for the same locked selected idea. It starts from the v2.2 pilot handoff: 300 negative, 150 positive, and 214 hard-negative synthetic traces; pilot-scale support for alpha=0.01; underpowered and blocked alpha=0.001; incomplete real prior-work records; weak scientific baseline coverage; and a manuscript package that is not publication-ready.
v2.3 must directly address those blockers:
- main-scale sample-size plan
alpha=0.001powered or explicitly dropped- real prior-work records for every required category
- completed related-work matrix
- stronger baseline suite
- main dataset generation or pilot-to-main expansion
- main benchmark run or explicit no-go
- publication-readiness reviewer panel
- manuscript package upgrade
- clear go/no-go recommendation
v2.3 must not claim deployment validity from synthetic evidence, must not claim alpha=0.001 while underpowered, must not hide missing real prior work, and must not call the manuscript publication-ready while fatal reviewer blockers remain.
Scriptable v2.3 path:
from gapforge import api
api.create_main_sample_size_plan("<benchmark-id>")
api.decide_alpha_target("<benchmark-id>", alpha=0.001)
api.complete_selected_related_work("<benchmark-id>")
api.complete_selected_related_work_matrix("<benchmark-id>")
api.register_main_baselines("<benchmark-id>")
dataset = api.build_main_dataset("<benchmark-id>")
manifest = api.create_selected_main_manifest("<benchmark-id>", dataset_id=dataset.id)
execution = api.run_selected_main_benchmark("<benchmark-id>", manifest_id=manifest.id)
api.analyze_selected_main_benchmark(execution.id)
api.selected_publication_readiness_review("<benchmark-id>")
api.selected_main_manuscript("<benchmark-id>")
api.v23_release_gate(write_report=True)The API names above describe the intended v2.3 workflow surface. Equivalent CLI commands or implementation-specific wrappers must preserve the same evidence requirements.
Implemented main-power CLI:
gapforge selected-main-power-plan --benchmark-id <benchmark-id>
gapforge selected-main-alpha-decision --benchmark-id <benchmark-id> --alpha 0.001
gapforge selected-main-power-report --benchmark-id <benchmark-id>
gapforge selected-related-work-complete --benchmark-id <benchmark-id>
gapforge selected-related-work-next-searches --benchmark-id <benchmark-id>
gapforge selected-related-work-status --benchmark-id <benchmark-id>
gapforge selected-baseline-strength --benchmark-id <benchmark-id>
gapforge implement-required-baseline-task --benchmark-id <benchmark-id> --baseline <baseline-name>
gapforge selected-baseline-strength-report --benchmark-id <benchmark-id>
gapforge build-main-trace-dataset --benchmark-id <benchmark-id>
gapforge main-trace-dataset-report --dataset-id <main-dataset-id>
gapforge selected-main-manifest --benchmark-id <benchmark-id> --dataset-id <main-dataset-id>
gapforge selected-main-run --benchmark-id <benchmark-id> --manifest-id <main-manifest-id>
gapforge selected-main-status --execution-id <main-execution-id>
gapforge selected-main-analysis --execution-id <main-execution-id>
gapforge selected-publication-review --benchmark-id <benchmark-id>
gapforge selected-publication-fix-list --benchmark-id <benchmark-id>
gapforge selected-main-manuscript --benchmark-id <benchmark-id>
gapforge selected-main-paper-package --benchmark-id <benchmark-id>
gapforge selected-benchmark-go-no-go --benchmark-id <benchmark-id>
gapforge selected-go-no-go-report --benchmark-id <benchmark-id>Planning docs:
- docs/V2_3_ROADMAP.md
- docs/V2_3_MAIN_BENCHMARK.md
- docs/V2_3_RELATED_WORK_COMPLETION.md
- docs/V2_3_PUBLICATION_READINESS.md
- docs/V2_3_ACCEPTANCE_CRITERIA.md
Use the active loop when GapForge should decide whether to search, parse, read, expand citations, check novelty, request human review, or stop:
gapforge run "low false positive collusion detection" --v3 --active --budget small --source-profile ai_safety
gapforge active-decisions --run-id <run-id>Every decision is written to active_decisions.md.
gapforge --help
gapforge init-topic "topic"
gapforge run "topic"
gapforge run "topic" --v2 --max-papers 20
gapforge run "topic" --v3 --project-id <project-id> --build-index --mature-directions
gapforge topic-portfolio --project-id <project-id>
gapforge idea-generate --project-id <project-id>
gapforge mutate-idea --idea-id <idea-id>
gapforge constructive-gaps --project-id <project-id>
gapforge transfer-ideas --project-id <project-id>
gapforge idea-novelty --idea-id <idea-id>
gapforge idea-tournament --project-id <project-id>
gapforge idea-feedback --idea-id <idea-id> --action accept
gapforge research-agenda --project-id <project-id>
gapforge idea-yield --project-id <project-id> --write-report
gapforge v2-release-gate --write-report --json
gapforge dashboard --run-id <run-id>
gapforge dashboard --project-id <project-id> --include-ideas
gapforge review-queue --run-id <run-id>
gapforge audit-artifacts --run-id <run-id>The same v2 workflows are scriptable from Python through gapforge.api: generate_topic_portfolio, create_idea_bank, generate_ideas, mutate_idea, generate_constructive_gaps, transfer_ideas, run_idea_novelty, run_idea_tournament, add_idea_feedback, generate_research_agenda, idea_yield, and v2_release_gate.
Core layers:
src/gapforge/models.py: typed run, project, evidence, retrieval, review, and export modelssrc/gapforge/state.py: durable run persistence and validationsrc/gapforge/project_memory.py: multi-run project memorysrc/gapforge/orchestrator.py: v0.1/v0.2/v0.3 orchestrationsrc/gapforge/orchestration/: active loop decisions and budgetssrc/gapforge/sources/: source connectors, coverage, ranking, and policiessrc/gapforge/skills/: deterministic and optional LLM-backed skillssrc/gapforge/retrieval/: local hybrid retrievalsrc/gapforge/reporting.py: conservative Markdown/JSON reports
See docs/ARCHITECTURE.md.
Evals are offline and fixture-driven. They measure behavior such as specificity, evidence linkage, duplicate rejection, source coverage transparency, retrieval relevance, direction maturity, and manuscript honesty. They do not prove real scientific usefulness.
make eval
gapforge eval --v2
gapforge eval --v3
gapforge eval --v4
gapforge eval --v5
gapforge eval --v6
gapforge eval --v7
gapforge eval --v8- GapForge does not perform exhaustive literature review.
- Offline fallback outputs are smoke tests.
- Semantic retrieval is a ranking aid, not proof of novelty.
- LLM outputs are untrusted until schema-valid and evidence-located.
- Fake-agent outputs validate plumbing only and never count as real Codex/GPT-5.4 research.
- Prompt-pack handoff counts as real only after Codex/GPT-5.4 outputs are validated, attested, and human-reviewed.
- Empirical claims are artifact-gated: no run record and result artifact means no supported result claim.
- Benchmark claims are stronger than fixture smoke claims and require benchmark records, real or benchmark-like data, consent where applicable, compute logs, result artifacts, analysis, replication packaging, and review.
- Manuscript submission-readiness requires traceable claims, known citations, artifact-backed results, artifact evaluation package status, blinding review when applicable, reviewer-objection handling, and human review.
- v0.9 external pilot success may be a defensible direction or an evidence-backed refusal; v1 readiness is a separate gate.
- v2 idea seeds are provisional; only accepted idea candidates have survived active search, novelty/counterevidence review, tournament comparison, and human review.
- v2 research agendas are honest fallback plans, not accepted ideas or paper-ready claims.
- PDFs, transcripts, and generated dashboards may be unsafe to commit.
- Human review is required before treating any direction as research-ready.
Use:
gapforge audit-artifacts --run-id <run-id>
gapforge export-safe-bundle --project-id <project-id>- v0.1: deterministic research OS foundation
- v0.2: full-text evidence and novelty dossier upgrade
- v0.3: semantic, optional LLM-assisted, multi-run project-memory upgrade
- v0.4: actual Codex/GPT-5.4 agentic campaign execution path, campaign recovery, multi-step canaries, v4 evals, and strict human-reviewed real-run acceptance gates
- v0.5: real-literature campaign quality with source diagnostics, search strategy, prior-work recall, quality review, v5 evals, and v5 release gate
- v0.6: experiment execution and empirical validation with run manifests, result artifacts, statistics, reproducibility checks, failed/negative result handling, and result claim ledger
- v0.7: real benchmark execution and replication with benchmark registry, dataset consent/cache, compute modes, jobs, sweeps, error/slice analysis, low-FPR power checks, benchmark comparison, and replication packages
- v0.8: manuscript, artifact evaluation, and reviewer-rebuttal workflow with manuscript projects, claim-to-paper traceability, citation/BibTeX management, venue templates, section drafting, figure/table generation, artifact evaluation packaging, blinding support, submission-readiness gates, and camera-ready checklists
- v0.9: external pilot and v1 readiness with one real end-to-end topic, live literature, direction/refusal decision, experiment protocol, artifact package, manuscript/rebuttal planning, migration audit, CLI/docs cleanup, artifact hygiene, external feedback, and explicit v1 gate
- Future: richer layout/OCR extraction, external reference-manager integration, larger distributed experiment runners, and collaborative review workflows