Releases: PattersonResearch/Penrose
Penrose v0.3.0
[0.3.0] — 2026-06-27
A robustness, agent-surface, and data release. Post-0.2.0 work, much of it surfaced by a fresh-clone
audit and by refereeing an external code-complete framework, then adversarially swarm-audited.
Added
- Tail-risk / widow-maker gate (default-off). Every backtest now reports tail diagnostics (skew,
CVaR-5/95, tail ratio, max loss vs gain, worst-vs-typical). An opt-inTAIL_RISK_GATEkills (or caps
atwatch) a stable, well-deflated strategy whose payoff is bounded-up / unbounded-down (negative
skew, fat left tail) — the short-vol / positive-carry blind spot the other gates miss. Default-off, so
no existing verdict moves;tail_asymmetricis a structural kill for principle formation. - Contrastive principles. A second distiller learns from the survivor-vs-kill boundary: when a
structural failure mode recurs in one domain but other domains yield survivors, it proposes an advisory
contrastive principle (e.g. "regime_fragile is specific to trend-following; carry survives it").
Additive (recurrence principles unchanged); surfaced viaviews.principles()and the read-only MCP. - Point-in-time futures data adapter (
pysystemtrade). A fail-open BYO local vendor that reads
pysystemtrade adjusted-price CSVs, always resamples intraday→daily through the granularity gate before
the data can reach verdict logic, and tags provenance back-adjusted + resampled. Instrument names are
restricted to safe characters (no path traversal). Inactive/harmless when no futures dir is configured. - Agent-readable principle surface.
views.principles()andviews.proposals()expose the
distilled principle candidates and the propose-only store as structured read-only data, so an agent
can pull and discuss "what candidates exist" without the dashboard. The read-only MCP routes its
penrose_principles/penrose_proposalstools through these accessors (one read path, no drift);
promotion to the approved brain still requires human P9. trend-followingdomain in cross-run principle inference, so trend / EWMAC claims cluster as
trend-following instead of falling through toother.- Data-granularity verification (
penrose.data.granularity). Infers a series' empirical sampling
frequency from its index and flags a mismatch with the expected frequency (e.g. intraday bars where a
rule assumes daily, which silently corrupts every downstream statistic). The input-side analogue of
the existing outputbars_per_year-vs-span check.DataBundle.granularity_warnings()surfaces it;
advisory and fail-open by default (no verdict change).
Fixed
- Trusted operator modules now ship in every clone. The public
.gitignore(modules/*) was
dropping the reviewedcrypto_funding_carryandmacro_vol_btcmodules from published clones, so a
fresh clone failed thePROVENANCE-SHELFeval invariant (92/93) even though the README documents both.
The two trusted modules now ship (generated_automodules stay ignored); a cold clone passes 93/93. - Graceful capacity on low-turnover strategies.
capacity_ciraisedOverflowErrorconverting an
infinite modeled capacity (a strategy that barely trades drives turnover toward zero) into an integer,
crashing the entire backtest. It now drops non-finite resamples and reports capacity as undefined,
consistent with the fail-visibly contract. Regression-tested. - Public test bar. A
test_clicheck readMakefile.public, which the public build renames to
Makefile, so the shipped test failed in the distribution it ships to. It now reads whichever exists;
the public pytest bar is a clean 137 passed / 2 skipped.
Docs
- Quickstart uses the real clone URL and surfaces the process-conditional worked example as the
recommended first reproduction; eval count corrected to 93/93 in AGENTS.md / CLAUDE.md. - Companion-paper bibliographies verified against publisher / arXiv records.
Penrose v0.2.0
[0.2.0] — 2026-06-25
A correctness-and-coverage release. Every change was implemented and adversarially swarm-audited;
the evaluation-invariant suite, the calibration battery (null + placebo + injection), and the unit
tests are green. The headline is two verdict-lane correctness fixes plus a real data unblock.
Verdict integrity
- Order-independent deflation denominator (5c). The Deflated Sharpe multiple-testing count is now
pre-registered as a per-family cohort before evaluation, instead of a running tally read at backtest
time. Previously the same strategy could get a different verdict depending on whether it ran 1st or
8th in its family (early members were under-deflated). Now every member deflates by the full family
size, uniformly and race-free. This can only tighten verdicts (it closes a selection-bias hole); no
existing eval outcome moved. - Module generation learns to be faithful (6c). Claims are routed by type
(descriptive-statistical / trading-strategy / structural-proposition) so a descriptive claim (e.g. an
unconditional mean) is implemented as a statistic test, not a trading backtest. A pre-backtest
fidelity gate flags unfaithful specs before the expensive run, and a fidelity-rejection memory feeds
past divergences back into generation. Fidelity only ever demotes or blocks, never promotes. - Regime-scope declaration. A claim can pre-register a declared regime and be tested fairly within
it (adherence-gated), instead of being falsely killed as regime-fragile for concentrating where it
intends to trade. - CPCV / overfitting kill-lens. Combinatorial purged cross-validation (Lopez de Prado) added as an
independent robustness axis next to the bootstrap, permutation, and walk-forward gates. - Actionable
underpoweredverdicts. A verdict that can't resolve a realistic edge now reports how
much more would resolve it, the marginal OOS trades still needed (or the cross-sectional breadth
alternative), turning a dead-end label into a sequential next step. - Independent fidelity verifier (optional). The fidelity refuter can route to a genuinely
independent second LLM provider (configurable viaPENROSE_LLM_VERIFIER_*), reducing the correlated
blind spots of a model checking its own work; it falls back to the same provider by default, and each
result records whether the check was independent.
Data ("works out of the box" for more than crypto)
- Catalog-derived domain awareness. The relevance gate and spec generator read the data catalog at
runtime, so adding a new-domain series (equities, rates, inflation, commodities) makes those theses
testable and lets the generator request real series names instead of inventing them. Fail-open to the
built-in behavior when no catalog is present. - Keyless long-history adapter (Stooq). A 6th out-of-the-box data adapter: decades of daily
equity/index data with no API key, filling the gap where the free Alpha Vantage tier (~100 bars)
flipped equity theses toinsufficient_data. - Conservative name-resolution. Near-miss series names resolve only on a unique high-confidence
match; ambiguous names miss (never a wrong-series resolution). - Auto-fetch the
needs_dataloop. When a claim needs a series an enabled vendor can supply
unambiguously, Penrose fetches it once and re-tests, instead of only logging the request. Bounded and
conservative (never supplies a wrong/ambiguous series). - Panel adapter. A
panelcatalog adapter type for resolution-outcome / microstructure data
(daily event-date aggregation), the framework for the largest class of data-blocked theses.
Learning surface (P9 firewall intact)
- Cross-run principle distillation. Structural-kill principles are now distilled across the whole
decision corpus, not just within a single run, so recurring failure modes actually surface. - Propose-only read store. An agent-readable record of "what Penrose has learned" (
status: proposed), strictly separate from the approved brain, promotion still requires human P9 approval.
Robustness & honesty
- Output directories are created on startup (fresh/CI/sandboxed clones no longer fail on a missing
reports/). - A fidelity-refuter network timeout degrades to "fidelity unknown" and continues, instead of killing
the run; LLM timeouts are configurable per role; optional--max-claims. - Re-running an unchanged source is idempotent (atomic supersede by source identity), instead of
appending duplicate decisions. - Strategy-class alias collisions no longer log spurious warnings.
Agentic surface & tooling
- Read-only MCP server (
pip install penrose[mcp],penrose-mcp). Five read-only tools let an
agent query verdicts, proposed principles, open data-requests, and pipeline status over the Model
Context Protocol. It exposes operations, not escape hatches: nothing over MCP can approve a verdict
(P9 stays human), write the corpus, or run anything.mcpis optional; the core never requires it. penrose run --jsonemits a single machine-readable result object (verdicts + principle), so callers
no longer have to tail a log;penrose run --claims <claims.json>injects pre-built structured claims
and bypasses the lossy P2 re-extraction round-trip.- Make targets honor a
PYoverride (e.g.make eval PY=./.venv/bin/python);.PHONYcompleted;
stress-testing docs linked. - The public-build pipeline is hardened: tracked-files-only staging (gitignored operator artifacts can
never ship), fund-specific leak markers, a symlink guard, and a dry-run-by-default sync.
Penrose v0.1.0
Penrose is an independent, power-aware falsification referee for quantitative trading claims. This is the first public release, a research prototype rather than a finished product.
What's in v0.1.0
- The full falsification pipeline: ingestion, grounded claim extraction, sandboxed reconstruction, a robustness and power gate battery, and a calibrated verdict (
kill/underpowered/watch/research-supported). - The statistical core: a Deflated Sharpe Ratio scoped to the size of the search seen, three-fold sign stability, a regime kill-lens, a bootstrap edge interval, a permutation test, walk-forward consistency, cost and capacity modeling, and a single-use per-claim locked holdout gated on significance.
- A self-calibrated detector: placebo, injected-edge, native-breadth, dead-state, and persistence-matched controls plus a multi-null battery (eval 82/82; clone-and-go reproduces it with no key and no network).
- Five working data adapters out of the box: Coinbase, Kraken, and Deribit (keyless live), plus FRED and Alpha Vantage (free key), over a bring-your-own data contract.
- Pennie, the corpus-grounded chat assistant, and the corpus of invalidations.
- Plain-language gate documentation (
docs/GATES.md), agent onboarding (AGENTS.md), and two companion papers (systems + evidence standard).
Pre-1.0 expectations
This is a 0.x release: interfaces may change, costs and capacity are modeled rather than measured, and independent replication remains future work. The headline guarantees are described honestly, deflation scales with the search seen, and the holdout is single-use and conservative. Please try to break it and open an issue.