Compile a markdown wiki into RAG-ready chunks. Drop-in replacement for Pinecone server-side embedding, but markdown-aware.
Open-source library from 42rows.com, the AI sales-intelligence platform that ships AI agents grounded on customer-specific wikis. We extracted the part that turns a wiki into vectors so any agent — yours, ours, anyone's — can consume it.
from wiki42 import compile_wiki
chunks = compile_wiki("./my-wiki/") # one chunk per page
# → load into Pinecone / FAISS / Chroma / Qdrant / WeaviateOne sentence. Smart about
[[wikilink]]graphs and YAML frontmatter. One vector per page (no chunk explosion, no source-bias amplification). Multilingual E5 embeddings out of the box. −84% tokens to your LLM vs feeding the matched files directly, identical or better answer quality depending on query type, no vendor lock-in.
It is a parser + embedder for markdown wikis. Input: a directory, a .zip, or an https:// URL. Output: a list of {id, text, embedding, metadata} dicts that any vector store can consume.
It isn't a vector store, a retrieval framework, an orchestrator, or an MCP server. It is one focused step of a RAG pipeline. Pick the vector store, retrieval policy, and downstream LLM that fit your use case — wiki42 hands you portable chunks and gets out of the way. An MCP server on top of the library is on the roadmap (v0.3) but is not what this package ships today.
| Pinecone server-side embedding | wiki42 | |
|---|---|---|
| Chunking | token-based, cuts mid-sentence | one chunk per page (atomic unit of meaning) |
| YAML frontmatter | ignored | extracted to typed metadata (filterable) |
[[wikilink]] graph |
ignored | metadata.wikilinks_out edge list |
| Embedding model | text-embedding-3-small (English-leaning) |
E5 multilingual large (IT/EN/DE/FR/ES/…) |
| Bias amplification | possible (split chunks repeat source bias) | none (full-page context, no LLM teacher) |
| Lock-in | vectors live in Pinecone | zero — parquet/jsonl, use any vector DB |
| Cost | $70/mo serverless minimum | $0 self-host (or pay-per-compile Apify actor) |
pip install wiki42
wiki42 compile ./my-wiki/ --out chunks.parquetThen drop into your vector store of choice (5 lines):
import polars as pl
from pinecone import Pinecone
chunks = pl.read_parquet("chunks.parquet")
index = Pinecone(api_key="...").Index("my-wiki")
index.upsert(vectors=[(c["id"], c["embedding"], c["metadata"]) for c in chunks.to_dicts()])Same idea with FAISS, Chroma, Qdrant, Weaviate, or local NumPy. The library hands you portable data — what you do with it is your call.
from wiki42 import compile_wiki
# 1-page-1-chunk (default), Pinecone Inference embeddings (cloud)
result = compile_wiki("./wiki/", embedding_model="pinecone:multilingual-e5-large")
# Local sentence-transformers (offline, no API)
result = compile_wiki("./wiki/", embedding_model="intfloat/multilingual-e5-large")
# Skip embeddings, just parse
result = compile_wiki("./wiki/", embedding_model=None)
print(result.pages_parsed, result.chunks_generated, result.compile_time_s)
for chunk in result.chunks:
print(chunk.id, chunk.metadata["title"], chunk.metadata["wikilinks_out"])wiki42 compile https://github.com/user/wiki/archive/main.zip --out chunks.parquetGitHub-style archives are unwrapped automatically (the top <repo>-<branch>/ folder is detected and skipped).
We ship two benches: a reproducible micro-bench you can run on the synthetic 4-page wiki in this repo, and aggregate numbers from an internal end-to-end bench done by 42rows.com on a 1,851-page customer wiki (raw queries and answers stay private; methodology and aggregates are below).
5 questions whose keywords are literally in the file or segment name.
Worst case for embeddings: a filesystem grep already finds the right
document by name.
| filesystem grep | wiki42 | Δ | |
|---|---|---|---|
| Input tokens to LLM | 16,354 ± 987 | 2,569 ± 237 | −84% |
| End-to-end latency | 10.9 s | 10.7 s | tie |
| Answer quality | grounded with citations | grounded with citations | tie |
5 questions written after reading the wiki content, phrased so the literal words do not appear in document titles. Retrieval has to work by meaning.
| filesystem grep | wiki42 | Δ | |
|---|---|---|---|
| Input tokens to LLM | 13,489 ± 2,916 | 2,481 ± 545 | −82% |
| End-to-end latency | 7.1 s | 9.0 s | grep 1.9 s faster |
| Answer quality (0–3 × 5, max 15) | 7 / 15 | 12 / 15 | wiki42 +71% |
| Hallucinated facts | 2 (invented numbers) | 0 | wiki42 wins |
The hard-query gain comes from two failure modes of pure-keyword retrieval: (1) when the question describes a concept by metaphor and not by literal title words, grep returns 20 unrelated documents and the LLM gives up; (2) when the answer is a specific number buried in one page and grep matches on the question's topic words instead, the LLM compensates by inventing a number. wiki42's page-level embeddings land on the right page directly in both cases.
See benchmarks/results.md for the full
methodology and benchmarks/bench.py for the
reproducible micro-bench harness:
python benchmarks/bench.py --modes parse,local,cloud --runs 3If your wiki is small (≤ 30 pages) and your agent already has filesystem
access (e.g. Claude Code on your local repo), grep -rli is faster,
cheaper, and good enough. wiki42 earns its keep when (a) you ship a
wiki to consumers that do not have filesystem access (Claude Desktop,
Cursor, Cline, hosted apps), (b) the queries are semantic rather than
keyword-matched, or (c) the same wiki gets queried by many people or
many times.
compile_wiki() returns Chunk objects shaped like:
Frontmatter fields land in metadata as-is — your vector DB's metadata filter can target any of them (kind=="decision" AND confidence>0.8).
wiki42/
├── src/wiki42/ ← the library
│ ├── __init__.py — public surface
│ ├── compile.py — compile_wiki()
│ ├── cli.py — wiki42 compile
│ ├── _parser.py — markdown + frontmatter + [[wikilinks]]
│ └── _embedder.py — E5 cloud + local backend
├── tests/ ← pytest suite (no network)
├── examples/ ← copy-paste snippets
├── benchmarks/ ← bench harness + sample wiki
└── pyproject.toml
The library has no runtime dependency on any other package in the org. Drop the wheel into any project; everything else (Apify hosting, MCP server) is an optional integration.
- v0.1 — library + CLI, page-level chunks, honest bench
- v0.2 —
ask_wiki(source, question)convenience wrapper - v0.3 — MCP server (Claude Desktop / Cursor / Cline) rebuilt on top of the library, 4-tool API
- v1.0 — stable Python API, semantic versioning
MIT — see LICENSE.
Built and maintained by 42rows S.r.l. — sales intelligence with AI agents. Author: Mario Brosco — 42rows.com.
Want to support the project? Star ⭐ this repo, share it, or check out 42rows.com.
{ "id": "companies/acme-logistics#0", // stable across recompiles "text": "passage: Acme Logistics S.p.A. ...", // E5 expects this prefix "embedding": [0.012, -0.089, ...], // 1024-dim float, or null "metadata": { "slug": "companies/acme-logistics", "title": "Acme Logistics S.p.A.", "kind": "company", // any frontmatter field "confidence": 0.88, "last_touched": "2026-05-10", "wikilinks_out": ["segments/retail-warehouse", "products/wms-suite"], "section_path": "", // populated if --split-h2 "char_count": 2147 } }