Skip to content

Releases: Archerkattri/mathlas

v1.4.0: MCP tool annotations + sandbox hardening

11 Jun 15:54

Choose a tag to compare

What's new in v1.4.0

Since v1.3.0:

  • MCP ToolAnnotations on every tool — all 12 tools now declare behaviour hints (readOnlyHint / idempotentHint / openWorldHint / destructiveHint) on both the FastMCP and the dependency-free fallback server. Clients can auto-approve the 10 read-only tools and gate the two state-mutating ones (add_finding, funsearch). Annotations degrade gracefully on older SDKs.
  • funsearch OS-level sandbox — replaced the socket monkey-patch with a real OS-level sandbox for the program-search harness.
  • Corrected Zenodo concept DOI.

Install: claude mcp add mathlas -- uvx mathlas-mcp · PyPI: https://pypi.org/project/mathlas-mcp/1.4.0/

mathlas v1.3.0: the 0.6B end-to-end laptop tier

10 Jun 23:53

Choose a tag to compare

The 0.6B end-to-end laptop tier

The same 3,683,428-document corpus and served representation channel, re-embedded once with Qwen3-Embedding-0.6B (1024-d, row-aligned with the served meta), so the query encoder itself runs on a laptop CPU. This closes the honest caveat of the v1.1 quantized tier, whose queries still needed 8B-space vectors from a GPU box.

Serve it with one env var (composes with the quantized sidecars):

MATHLAS_ENCODER=0.6b MATHLAS_QUANTIZED=binary python -m mathlas.server

Measured (n=3000 body-to-slogan self-recall, full 3.68M index)

dense config R@1 R@10 end-to-end warm
0.6B fp16 exact (tier baseline) 0.544 0.745 -
0.6B binary top-1000 + int8 rescore 0.545 0.745 0.67 s/query (4 CPU threads), 0.88 s (2)
8B quantized tier (reference) 0.614 0.832 2.4 s/query, search only (encoder needs a GPU box)
  • Quantization is again recall-lossless within the tier; raw 1024-bit Hamming alone loses 4.5pp R@1, so the rescore stage is load-bearing at this dimension.
  • The honest price of CPU-sized query encoding is about 7-9pp recall vs the 8B tier. The dual-channel 8B configuration (0.965 / 0.999) remains the big-box quality ceiling.
  • The latency number is the first in the ladder to include query encoding, because this is the first tier whose encoder (~1.2 GB) fits on the target machine.
  • Dense-channel footprint: 0.47 GB binary sidecar + ~1.2 GB encoder (~1.7 GB), with a 3.77 GB int8 rescore source read per query, not resident.
  • TheoremSearch-110 corpus-only probe: Hit@20 8.2% / 10.0% theorem/paper vs the 8B tier's 10.0% / 11.8% (both licensing-bounded floors).

Full tables, protocol, and caveats: docs/QUANTIZED_TIER.md and RESULTS.md. Build: scripts/build_06b_index.py; eval: scripts/eval_06b_tier.py; serving mechanism pinned by tests/test_encoder_tier.py.

Gate: pytest 118 passed / 1 skipped.

v1.2.0: dual-channel retrieval, proof checking, quantized tier, source-aware search

10 Jun 21:40

Choose a tag to compare

mathlas v1.2.0: the retrieval-upgrades release. Every number below is measured on the served 3,683,428-document index (Qwen3-Embedding-8B, 4096-d; provenance: 1,341,083 permissive CC-BY/CC0 TheoremSearch-subset docs + 2,342,345 slogan-embedded Dolma arXiv-math docs; index finalized 2026-06, all evals re-run 2026-06-10) and is reproducible from the scripts named in RESULTS.md.

Dual-channel retrieval (new, opt-in)

A second dense channel: the same 3.68M docs embedded by their cleaned LaTeX statement (row-aligned with the served slogan matrix; built by scripts/build_statement_channel.py), folded into the dense ranking by per-doc max-sim.

  • Full-corpus self-recall (n=3000, same cached queries as the published baseline): R@1 0.614 -> 0.965, R@10 0.832 -> 0.999.
  • Honest caveat, stated up front: that eval is a self-retrieval proxy in which the statement channel indexes the very text the queries are drawn from, so it carries an exact-text advantage (like BM25). On the no-leak TheoremSearch-110 human-query benchmark the lift is real but partial: paper Hit@20 11.8% -> 12.7%, theorem 10.0% -> 10.9%. The index-growth regression (13.6% -> 11.8% after adding the 2.34M Dolma docs) is therefore NOT fully repaired structurally; the opt-in source-aware exclude knob remains the documented full mitigation (13.6%, 15/15 reachable).
  • Serving: strictly opt-in via MATHLAS_STATEMENT_INDEX (never auto-detected). Measured at full scale: 150 GB process peak (vs ~95 GB single-channel), 264 s load, ~2.75 s/query dual dense scan on 2 CPU threads, 20/20 top-10 on real cached query vectors through the genuine retrieve() path. Serving tiers, honestly: binary sidecar 1.9 GB / int8 sidecar 15.1 GB / fp16 single channel ~57.5 GB matrix RAM (the default) / fp16 dual channel ~115 GB matrix RAM (plan for a ~192 GB box). The dual tier is not combinable with the quantized tier.
  • Shipping this exposed a real loader bug, fixed in this release: from_index used to materialize fp16 + fp32 + normalization temps per matrix, which transiently exceeded 250 GB and OOM-killed the dual load on the 251 GB build box. Matrices are now streamed memmap -> chunked unit-norm fp32, so peak load memory equals the resident footprint (verified at full scale, regression-tested). This also cuts the default single-channel load peak by ~30 GB.

Retrieval upgrades shipped alongside

  • Hybrid RRF default rrf_k 60 -> 10: wins every metric on the n=3000 sample (R@1 0.764 -> 0.771, R@10 0.991 -> 0.999).
  • Opt-in cross-encoder rerank (MATHLAS_RERANK=1, Qwen3-Reranker-0.6B): RRF-blended with the first-stage order, never a replacement (replacement measurably HURTS R@1, and we say so); honest +1.7pp R@1 / +3.9pp R@10 cross-representation lift over dense.
  • Source-aware retrieval: opt-in source_filter / source_weights on search_existing_math (canonical keys arxiv / dolma / stacks / proofwiki / other). Excluding dolma recovers the pre-growth paper-level 13.6% on the 110-query bench; the default ranking stays byte-identical (test-pinned), because on the n=3000 self-recall down-weighting dolma is catastrophic for dolma-target queries.

Verification and serving features

  • Proof checking, the repair loop: verify_formal(statement, proof=...) kernel-checks a full Lean 4 proof with the real Lean 4.30 kernel; REFUTED returns the kernel's error verbatim; sorry/admit are REJECTED (Lean exits 0 on sorried proofs, so the naive check would fake-pass); toolchain-absent/timeout return honest UNDETERMINED. 20 pinned tests.
  • Formal-search cache: search_formal_math keeps a 7-day on-disk cache and serves the last good Loogle/LeanSearch response when the service is down, clearly labeled with its age.
  • Quantized laptop tier: MATHLAS_QUANTIZED=binary|int8 serves the same 3.68M dense channel from memmapped sidecars (1.9 GB binary Hamming + exact rescore / 15.1 GB int8). Measured recall-lossless vs fp16 (R@1 0.6143 vs 0.6140, R@10 equal) at 2.4 s/query on 4 CPU threads. Caveat: queries must still be embedded by the index's own 8B encoder.
  • Hardened agent benchmark: 18 tasks (10 original, now saturated by frontier models and reported as such, + 8 verification-bottleneck hard tasks). Claude Fable 5 WITH mathlas 18/18 vs WITHOUT 15/18; the whole gap is the hard set (8/8 vs 5/8).

Correctness fixes (Codex review)

  • HashingEmbedder now uses deterministic blake2b instead of Python's salted hash(): a persisted hashing index built in one process is no longer silently broken when queried from another. The flagship Qwen3 index never touched this path.
  • retrieve(mode=...) validates the mode and raises a clear ValueError instead of returning silently empty results on a typo.
  • mathlas.__version__ now reads the installed package metadata (single source of truth), with a test pinning it to pyproject.

Gate

  • pytest: 111 passed, 1 skipped (opt-in network test), 0 failed.
  • All §1 tiers re-run green: numeric 8/8 (0/3 false positives), sequence 8/8, formal 7/7, ramanujan 6/6 (0/2 fp), moat 15/15 + 6/6, tools 14/14.

Full numbers and caveats: README.md, RESULTS.md, docs/RETRIEVAL_UPGRADE_NOTES.md, docs/02_eval_vs_theoremsearch.md, docs/QUANTIZED_TIER.md.

Archive

This release is archived on Zenodo: version DOI 10.5281/zenodo.20633923 (concept DOI 10.5281/zenodo.20618603, as cited in CITATION.cff).

v1.1.2 — Apache-2.0 relicense

10 Jun 00:30

Choose a tag to compare

The mathlas code is now Apache-2.0 (was PolyForm-NC). Published corpus/index artifacts keep their own per-source terms (CC-BY/CC0). Also: official-registry namespace case fix and Zenodo DOI landed in 1.1.1.

pip install -U mathlas-mcp

v1.1.0 — formal search · reranker blend · dual-channel retrieval · 12-tool schema pass

09 Jun 23:50

Choose a tag to compare

mathlas-mcp 1.1.0 — the verification-first release.

New

  • search_formal_math: Loogle + LeanSearch proxy returning mathlib declarations with provenance — live eval Hit@1/5/10 = 0.64/0.96/0.96 over 25 index-verified queries.
  • Qwen3-Reranker-0.6B blend (opt-in): dense R@1 0.731→0.748, R@5 0.933→0.974. (Honest finding: replacement reranking hurts — relevance ≠ row identity — so it ships as an RRF blend.)
  • Dual-channel retrieval (opt-in): per-doc max-sim over slogan + statement embeddings; 200k-slice validation R@1 0.833→0.981. Full-corpus statement channel building.
  • uvx mathlas-mcp one-liner; official-registry server.json; 12-tool surface with outputSchema/title (funsearch collapsed to one action-enum tool).

Changed

  • Index now 3,683,428 docs (1.34M + 2.34M slogan-embedded Dolma); honest cross-representation headline R@1 0.614 / R@10 0.832.
  • RRF k default 60→10 (evaluated); BM25 top-k via argpartition (9× sort speedup, deterministic ties).
  • README positioning rewritten: verification tiers + applicability checklist + the self-augmenting loop are the moat; comparison table vs TheoremSearch / LeanSearch / Wolfram MCP / sympy-mcp.

pip install -U mathlas-mcp · claude mcp add mathlas -- uvx mathlas-mcp

mathlas-mcp 1.0.1

07 Jun 16:38

Choose a tag to compare

Production release. Free, no-LLM, airtight-math MCP tool an AI uses — 1.635M-doc retrieval index, OEIS/Lean/PSLQ verification, self-augmenting web loop. pip install mathlas-mcp