Skip to content

mathlas v1.3.0: the 0.6B end-to-end laptop tier

Choose a tag to compare

@Archerkattri Archerkattri released this 10 Jun 23:53
· 11 commits to main since this release

The 0.6B end-to-end laptop tier

The same 3,683,428-document corpus and served representation channel, re-embedded once with Qwen3-Embedding-0.6B (1024-d, row-aligned with the served meta), so the query encoder itself runs on a laptop CPU. This closes the honest caveat of the v1.1 quantized tier, whose queries still needed 8B-space vectors from a GPU box.

Serve it with one env var (composes with the quantized sidecars):

MATHLAS_ENCODER=0.6b MATHLAS_QUANTIZED=binary python -m mathlas.server

Measured (n=3000 body-to-slogan self-recall, full 3.68M index)

dense config R@1 R@10 end-to-end warm
0.6B fp16 exact (tier baseline) 0.544 0.745 -
0.6B binary top-1000 + int8 rescore 0.545 0.745 0.67 s/query (4 CPU threads), 0.88 s (2)
8B quantized tier (reference) 0.614 0.832 2.4 s/query, search only (encoder needs a GPU box)
  • Quantization is again recall-lossless within the tier; raw 1024-bit Hamming alone loses 4.5pp R@1, so the rescore stage is load-bearing at this dimension.
  • The honest price of CPU-sized query encoding is about 7-9pp recall vs the 8B tier. The dual-channel 8B configuration (0.965 / 0.999) remains the big-box quality ceiling.
  • The latency number is the first in the ladder to include query encoding, because this is the first tier whose encoder (~1.2 GB) fits on the target machine.
  • Dense-channel footprint: 0.47 GB binary sidecar + ~1.2 GB encoder (~1.7 GB), with a 3.77 GB int8 rescore source read per query, not resident.
  • TheoremSearch-110 corpus-only probe: Hit@20 8.2% / 10.0% theorem/paper vs the 8B tier's 10.0% / 11.8% (both licensing-bounded floors).

Full tables, protocol, and caveats: docs/QUANTIZED_TIER.md and RESULTS.md. Build: scripts/build_06b_index.py; eval: scripts/eval_06b_tier.py; serving mechanism pinned by tests/test_encoder_tier.py.

Gate: pytest 118 passed / 1 skipped.