mathlas v1.3.0: the 0.6B end-to-end laptop tier
The 0.6B end-to-end laptop tier
The same 3,683,428-document corpus and served representation channel, re-embedded once with Qwen3-Embedding-0.6B (1024-d, row-aligned with the served meta), so the query encoder itself runs on a laptop CPU. This closes the honest caveat of the v1.1 quantized tier, whose queries still needed 8B-space vectors from a GPU box.
Serve it with one env var (composes with the quantized sidecars):
MATHLAS_ENCODER=0.6b MATHLAS_QUANTIZED=binary python -m mathlas.serverMeasured (n=3000 body-to-slogan self-recall, full 3.68M index)
| dense config | R@1 | R@10 | end-to-end warm |
|---|---|---|---|
| 0.6B fp16 exact (tier baseline) | 0.544 | 0.745 | - |
| 0.6B binary top-1000 + int8 rescore | 0.545 | 0.745 | 0.67 s/query (4 CPU threads), 0.88 s (2) |
| 8B quantized tier (reference) | 0.614 | 0.832 | 2.4 s/query, search only (encoder needs a GPU box) |
- Quantization is again recall-lossless within the tier; raw 1024-bit Hamming alone loses 4.5pp R@1, so the rescore stage is load-bearing at this dimension.
- The honest price of CPU-sized query encoding is about 7-9pp recall vs the 8B tier. The dual-channel 8B configuration (0.965 / 0.999) remains the big-box quality ceiling.
- The latency number is the first in the ladder to include query encoding, because this is the first tier whose encoder (~1.2 GB) fits on the target machine.
- Dense-channel footprint: 0.47 GB binary sidecar + ~1.2 GB encoder (~1.7 GB), with a 3.77 GB int8 rescore source read per query, not resident.
- TheoremSearch-110 corpus-only probe: Hit@20 8.2% / 10.0% theorem/paper vs the 8B tier's 10.0% / 11.8% (both licensing-bounded floors).
Full tables, protocol, and caveats: docs/QUANTIZED_TIER.md and RESULTS.md. Build: scripts/build_06b_index.py; eval: scripts/eval_06b_tier.py; serving mechanism pinned by tests/test_encoder_tier.py.
Gate: pytest 118 passed / 1 skipped.