Skip to content

Bonorinoa/leanecon_v3

Repository files navigation

Lean Econ v3

Lean Econ v3 is a clean-slate thin harness for research-grade economic-theory formalization and verification in Lean 4.

The preserved moat is lean_workspace/ plus the REPL/guardrail/observability spine ported from v2. The prover is claim-type-aware: preamble-definable claims use LeanEcon metadata and direct closure (zero tool calls), while mathlib-native claims enter mathlib_native_mode and use lean-lsp-mcp goal, diagnostic, code-action, hover, LeanSearch, and Loogle tooling.

Mathlib-native goals run through a harness-owned hybrid retrieval pipeline (Sprints 21–23): a local RAG over a ~1480-entry seed of Mathlib Topology / Order / Analysis / FixedPoints premises (semantic embedding via sentence-transformers/all-MiniLM-L6-v2, with a hashing fallback) plus first-class lean_leansearch results enriched with full type signatures and docstrings via cached lean_file_outline / lean_hover_info. Stalled turns trigger a refined second-pass query; hallucinated identifiers (unknown identifier) trigger a rescue retrieval keyed on the parsed concept tokens. Every retrieval, every failure, and every recovery attempt is observable in the per-run JSONL trace (Sprint 24).

Repository evidence is authoritative. Historical benchmark JSON under benchmark_baselines/v3_alpha/ is preserved for audit history and should not be treated as public release truth.

The current hosted benchmark-default stack is:

  • Planner: mistral-structured with mistral-large-2512
  • Formalizer: leanstral via Mistral
  • Prover: leanstral via Mistral, with bounded Lean REPL + lean-lsp-mcp tool use

Quick Start

python -m pip install -e ".[dev]"
cd lean_workspace && lake build
cd ..
uvicorn src.api:app --host 0.0.0.0 --port 8000

For mathlib-native proving, make sure lean-lsp-mcp is available either in .venv/bin/lean-lsp-mcp or through uvx lean-lsp-mcp. The client runs it over stdio with the Lean workspace path.

Repo Highlights

  • docs/CHARTER_v3.md: mission, assets, liabilities, and alpha success metrics.
  • docs/ARCHITECTURE_v3.md: authoritative v3 runtime architecture.
  • docs/LeanEcon Engineering Log.md: sprint-by-sprint implementation history through Session 24 (hybrid retrieval, synthesis lift, observable failure paths).
  • docs/RAILWAY_DEPLOYMENT_CHECKLIST.md: minimum deployment bar before any Railway readiness claim.
  • skills/*.md: runtime process knowledge for HILBERT, APOLLO, guardrails, and the preamble model.
  • evals/claim_sets/: canonical benchmark inputs plus the v3 alpha PhD-qual additions. Historical mixed sets live under evals/claim_sets/archive/, and regression-only utilities live under evals/claim_sets/regressions/.
  • evals/benchmark_manifest.json: claim-set bucket mix, theorem-stub usage, and integrity status.

Benchmark Note

python -m evals.local_gate now defaults to tier0_smoke, tier1_core_preamble_definable, tier2_frontier_mathlib_native, and tier2_frontier_preamble_definable. Use --benchmark-mode to force the full Planner -> Formalizer -> Prover path with benchmark guardrails, periodic heartbeat lines during long claims, explicit verified_via tagging, and claim-type-aware trace metadata.

For integrity-sensitive work, prefer the canonical split sets tier1_core_preamble_definable, tier2_frontier_mathlib_native, and tier2_frontier_preamble_definable. The old mixed sets are archived under evals/claim_sets/archive/ and are no longer part of the canonical benchmark surface.

The benchmark flywheel records average tool calls, LSP tool calls, native search attempts, mathlib_native_mode usage, direct-close rate, decomposition depth, latency, cost, failure classes, claim-type mix, retrieval-side metrics (retrieval_hit_rate@5, leansearch_hit_rate@5, enriched_leansearch_hit_rate, second_retrieval_rate, hybrid_retrieval_latency_ms), and synthesis-side metrics (synthesis_efficiency, premise_match_rate@3, avg_decomposition_depth_mathlib) in benchmark_baselines/v3_alpha/benchmark_history.jsonl.

Current state (Sprint 25 implementation checkpoint): 86.5% benchmark-mode pass rate (32/37) on the canonical Sprint 24 claim sets; tier1_core_preamble_definable at 24/24 with 0 tool calls; tier2_frontier_mathlib_native was stuck at 1/3 across Sprints 20–24 despite a fully-instrumented hybrid retrieval pipeline. Sprint 25 adds deterministic proof-sketch injection, generic few-shot tactic patterns, SynthesisEvent metrics, and verified helper-lemma memory. Baseline promotion remains gated on the next canonical benchmark run reaching the Sprint 25 pass/regression thresholds.

Deployment Path

Deployment readiness is currently defined by a clean lake build, focused Python tests, benchmark-mode local-gate, /health, /metrics, configured Mistral/Leanstral credentials, SQLite job storage, and runtime access to lean-lsp-mcp. Railway deployment should follow docs/RAILWAY_DEPLOYMENT_CHECKLIST.md and the updated architecture doc.

About

**A research-grade agentic formalizer and verifier for economic theory.**

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors