Releases: Archerkattri/mathlas
v1.4.0: MCP tool annotations + sandbox hardening
What's new in v1.4.0
Since v1.3.0:
- MCP ToolAnnotations on every tool — all 12 tools now declare behaviour hints (
readOnlyHint/idempotentHint/openWorldHint/destructiveHint) on both the FastMCP and the dependency-free fallback server. Clients can auto-approve the 10 read-only tools and gate the two state-mutating ones (add_finding,funsearch). Annotations degrade gracefully on older SDKs. - funsearch OS-level sandbox — replaced the socket monkey-patch with a real OS-level sandbox for the program-search harness.
- Corrected Zenodo concept DOI.
Install: claude mcp add mathlas -- uvx mathlas-mcp · PyPI: https://pypi.org/project/mathlas-mcp/1.4.0/
mathlas v1.3.0: the 0.6B end-to-end laptop tier
The 0.6B end-to-end laptop tier
The same 3,683,428-document corpus and served representation channel, re-embedded once with Qwen3-Embedding-0.6B (1024-d, row-aligned with the served meta), so the query encoder itself runs on a laptop CPU. This closes the honest caveat of the v1.1 quantized tier, whose queries still needed 8B-space vectors from a GPU box.
Serve it with one env var (composes with the quantized sidecars):
MATHLAS_ENCODER=0.6b MATHLAS_QUANTIZED=binary python -m mathlas.serverMeasured (n=3000 body-to-slogan self-recall, full 3.68M index)
| dense config | R@1 | R@10 | end-to-end warm |
|---|---|---|---|
| 0.6B fp16 exact (tier baseline) | 0.544 | 0.745 | - |
| 0.6B binary top-1000 + int8 rescore | 0.545 | 0.745 | 0.67 s/query (4 CPU threads), 0.88 s (2) |
| 8B quantized tier (reference) | 0.614 | 0.832 | 2.4 s/query, search only (encoder needs a GPU box) |
- Quantization is again recall-lossless within the tier; raw 1024-bit Hamming alone loses 4.5pp R@1, so the rescore stage is load-bearing at this dimension.
- The honest price of CPU-sized query encoding is about 7-9pp recall vs the 8B tier. The dual-channel 8B configuration (0.965 / 0.999) remains the big-box quality ceiling.
- The latency number is the first in the ladder to include query encoding, because this is the first tier whose encoder (~1.2 GB) fits on the target machine.
- Dense-channel footprint: 0.47 GB binary sidecar + ~1.2 GB encoder (~1.7 GB), with a 3.77 GB int8 rescore source read per query, not resident.
- TheoremSearch-110 corpus-only probe: Hit@20 8.2% / 10.0% theorem/paper vs the 8B tier's 10.0% / 11.8% (both licensing-bounded floors).
Full tables, protocol, and caveats: docs/QUANTIZED_TIER.md and RESULTS.md. Build: scripts/build_06b_index.py; eval: scripts/eval_06b_tier.py; serving mechanism pinned by tests/test_encoder_tier.py.
Gate: pytest 118 passed / 1 skipped.
v1.2.0: dual-channel retrieval, proof checking, quantized tier, source-aware search
mathlas v1.2.0: the retrieval-upgrades release. Every number below is measured on the served 3,683,428-document index (Qwen3-Embedding-8B, 4096-d; provenance: 1,341,083 permissive CC-BY/CC0 TheoremSearch-subset docs + 2,342,345 slogan-embedded Dolma arXiv-math docs; index finalized 2026-06, all evals re-run 2026-06-10) and is reproducible from the scripts named in RESULTS.md.
Dual-channel retrieval (new, opt-in)
A second dense channel: the same 3.68M docs embedded by their cleaned LaTeX statement (row-aligned with the served slogan matrix; built by scripts/build_statement_channel.py), folded into the dense ranking by per-doc max-sim.
- Full-corpus self-recall (n=3000, same cached queries as the published baseline): R@1 0.614 -> 0.965, R@10 0.832 -> 0.999.
- Honest caveat, stated up front: that eval is a self-retrieval proxy in which the statement channel indexes the very text the queries are drawn from, so it carries an exact-text advantage (like BM25). On the no-leak TheoremSearch-110 human-query benchmark the lift is real but partial: paper Hit@20 11.8% -> 12.7%, theorem 10.0% -> 10.9%. The index-growth regression (13.6% -> 11.8% after adding the 2.34M Dolma docs) is therefore NOT fully repaired structurally; the opt-in source-aware exclude knob remains the documented full mitigation (13.6%, 15/15 reachable).
- Serving: strictly opt-in via
MATHLAS_STATEMENT_INDEX(never auto-detected). Measured at full scale: 150 GB process peak (vs ~95 GB single-channel), 264 s load, ~2.75 s/query dual dense scan on 2 CPU threads, 20/20 top-10 on real cached query vectors through the genuine retrieve() path. Serving tiers, honestly: binary sidecar 1.9 GB / int8 sidecar 15.1 GB / fp16 single channel ~57.5 GB matrix RAM (the default) / fp16 dual channel ~115 GB matrix RAM (plan for a ~192 GB box). The dual tier is not combinable with the quantized tier. - Shipping this exposed a real loader bug, fixed in this release:
from_indexused to materialize fp16 + fp32 + normalization temps per matrix, which transiently exceeded 250 GB and OOM-killed the dual load on the 251 GB build box. Matrices are now streamed memmap -> chunked unit-norm fp32, so peak load memory equals the resident footprint (verified at full scale, regression-tested). This also cuts the default single-channel load peak by ~30 GB.
Retrieval upgrades shipped alongside
- Hybrid RRF default
rrf_k60 -> 10: wins every metric on the n=3000 sample (R@1 0.764 -> 0.771, R@10 0.991 -> 0.999). - Opt-in cross-encoder rerank (
MATHLAS_RERANK=1, Qwen3-Reranker-0.6B): RRF-blended with the first-stage order, never a replacement (replacement measurably HURTS R@1, and we say so); honest +1.7pp R@1 / +3.9pp R@10 cross-representation lift over dense. - Source-aware retrieval: opt-in
source_filter/source_weightsonsearch_existing_math(canonical keys arxiv / dolma / stacks / proofwiki / other). Excluding dolma recovers the pre-growth paper-level 13.6% on the 110-query bench; the default ranking stays byte-identical (test-pinned), because on the n=3000 self-recall down-weighting dolma is catastrophic for dolma-target queries.
Verification and serving features
- Proof checking, the repair loop:
verify_formal(statement, proof=...)kernel-checks a full Lean 4 proof with the real Lean 4.30 kernel; REFUTED returns the kernel's error verbatim; sorry/admit are REJECTED (Lean exits 0 on sorried proofs, so the naive check would fake-pass); toolchain-absent/timeout return honest UNDETERMINED. 20 pinned tests. - Formal-search cache:
search_formal_mathkeeps a 7-day on-disk cache and serves the last good Loogle/LeanSearch response when the service is down, clearly labeled with its age. - Quantized laptop tier:
MATHLAS_QUANTIZED=binary|int8serves the same 3.68M dense channel from memmapped sidecars (1.9 GB binary Hamming + exact rescore / 15.1 GB int8). Measured recall-lossless vs fp16 (R@1 0.6143 vs 0.6140, R@10 equal) at 2.4 s/query on 4 CPU threads. Caveat: queries must still be embedded by the index's own 8B encoder. - Hardened agent benchmark: 18 tasks (10 original, now saturated by frontier models and reported as such, + 8 verification-bottleneck hard tasks). Claude Fable 5 WITH mathlas 18/18 vs WITHOUT 15/18; the whole gap is the hard set (8/8 vs 5/8).
Correctness fixes (Codex review)
HashingEmbeddernow uses deterministic blake2b instead of Python's saltedhash(): a persisted hashing index built in one process is no longer silently broken when queried from another. The flagship Qwen3 index never touched this path.retrieve(mode=...)validates the mode and raises a clear ValueError instead of returning silently empty results on a typo.mathlas.__version__now reads the installed package metadata (single source of truth), with a test pinning it to pyproject.
Gate
- pytest: 111 passed, 1 skipped (opt-in network test), 0 failed.
- All §1 tiers re-run green: numeric 8/8 (0/3 false positives), sequence 8/8, formal 7/7, ramanujan 6/6 (0/2 fp), moat 15/15 + 6/6, tools 14/14.
Full numbers and caveats: README.md, RESULTS.md, docs/RETRIEVAL_UPGRADE_NOTES.md, docs/02_eval_vs_theoremsearch.md, docs/QUANTIZED_TIER.md.
Archive
This release is archived on Zenodo: version DOI 10.5281/zenodo.20633923 (concept DOI 10.5281/zenodo.20618603, as cited in CITATION.cff).
v1.1.2 — Apache-2.0 relicense
The mathlas code is now Apache-2.0 (was PolyForm-NC). Published corpus/index artifacts keep their own per-source terms (CC-BY/CC0). Also: official-registry namespace case fix and Zenodo DOI landed in 1.1.1.
pip install -U mathlas-mcp
v1.1.0 — formal search · reranker blend · dual-channel retrieval · 12-tool schema pass
mathlas-mcp 1.1.0 — the verification-first release.
New
search_formal_math: Loogle + LeanSearch proxy returning mathlib declarations with provenance — live eval Hit@1/5/10 = 0.64/0.96/0.96 over 25 index-verified queries.- Qwen3-Reranker-0.6B blend (opt-in): dense R@1 0.731→0.748, R@5 0.933→0.974. (Honest finding: replacement reranking hurts — relevance ≠ row identity — so it ships as an RRF blend.)
- Dual-channel retrieval (opt-in): per-doc max-sim over slogan + statement embeddings; 200k-slice validation R@1 0.833→0.981. Full-corpus statement channel building.
uvx mathlas-mcpone-liner; official-registryserver.json; 12-tool surface with outputSchema/title (funsearch collapsed to one action-enum tool).
Changed
- Index now 3,683,428 docs (1.34M + 2.34M slogan-embedded Dolma); honest cross-representation headline R@1 0.614 / R@10 0.832.
- RRF k default 60→10 (evaluated); BM25 top-k via argpartition (9× sort speedup, deterministic ties).
- README positioning rewritten: verification tiers + applicability checklist + the self-augmenting loop are the moat; comparison table vs TheoremSearch / LeanSearch / Wolfram MCP / sympy-mcp.
pip install -U mathlas-mcp · claude mcp add mathlas -- uvx mathlas-mcp
mathlas-mcp 1.0.1
Production release. Free, no-LLM, airtight-math MCP tool an AI uses — 1.635M-doc retrieval index, OEIS/Lean/PSLQ verification, self-augmenting web loop. pip install mathlas-mcp