Skip to content

42ROWS/wiki42

wiki42

PyPI Python License: MIT Docker MCP Maintained by 42rows

Compile a markdown wiki into RAG-ready chunks. Drop-in replacement for Pinecone server-side embedding, but markdown-aware.

Open-source library from 42rows.com, the AI sales-intelligence platform that ships AI agents grounded on customer-specific wikis. We extracted the part that turns a wiki into vectors so any agent — yours, ours, anyone's — can consume it.

from wiki42 import compile_wiki

chunks = compile_wiki("./my-wiki/")            # one chunk per page
# → load into Pinecone / FAISS / Chroma / Qdrant / Weaviate

One sentence. Smart about [[wikilink]] graphs and YAML frontmatter. One vector per page (no chunk explosion, no source-bias amplification). Multilingual E5 embeddings out of the box. −84% tokens to your LLM vs feeding the matched files directly, identical or better answer quality depending on query type, no vendor lock-in.


What this is (and isn't)

It is a parser + embedder for markdown wikis. Input: a directory, a .zip, or an https:// URL. Output: a list of {id, text, embedding, metadata} dicts that any vector store can consume.

It isn't a vector store, a retrieval framework, an orchestrator, or an MCP server. It is one focused step of a RAG pipeline. Pick the vector store, retrieval policy, and downstream LLM that fit your use case — wiki42 hands you portable chunks and gets out of the way. An MCP server on top of the library is on the roadmap (v0.3) but is not what this package ships today.


Why not just use Pinecone's upsert_from_text_files?

Pinecone server-side embedding wiki42
Chunking token-based, cuts mid-sentence one chunk per page (atomic unit of meaning)
YAML frontmatter ignored extracted to typed metadata (filterable)
[[wikilink]] graph ignored metadata.wikilinks_out edge list
Embedding model text-embedding-3-small (English-leaning) E5 multilingual large (IT/EN/DE/FR/ES/…)
Bias amplification possible (split chunks repeat source bias) none (full-page context, no LLM teacher)
Lock-in vectors live in Pinecone zero — parquet/jsonl, use any vector DB
Cost $70/mo serverless minimum $0 self-host (or pay-per-compile Apify actor)

Quick start

pip install wiki42
wiki42 compile ./my-wiki/ --out chunks.parquet

Then drop into your vector store of choice (5 lines):

import polars as pl
from pinecone import Pinecone

chunks = pl.read_parquet("chunks.parquet")
index = Pinecone(api_key="...").Index("my-wiki")
index.upsert(vectors=[(c["id"], c["embedding"], c["metadata"]) for c in chunks.to_dicts()])

Same idea with FAISS, Chroma, Qdrant, Weaviate, or local NumPy. The library hands you portable data — what you do with it is your call.

Python library

from wiki42 import compile_wiki

# 1-page-1-chunk (default), Pinecone Inference embeddings (cloud)
result = compile_wiki("./wiki/", embedding_model="pinecone:multilingual-e5-large")

# Local sentence-transformers (offline, no API)
result = compile_wiki("./wiki/", embedding_model="intfloat/multilingual-e5-large")

# Skip embeddings, just parse
result = compile_wiki("./wiki/", embedding_model=None)

print(result.pages_parsed, result.chunks_generated, result.compile_time_s)
for chunk in result.chunks:
    print(chunk.id, chunk.metadata["title"], chunk.metadata["wikilinks_out"])

From a remote zip

wiki42 compile https://github.com/user/wiki/archive/main.zip --out chunks.parquet

GitHub-style archives are unwrapped automatically (the top <repo>-<branch>/ folder is detected and skipped).


Benchmark — honest, end-to-end

We ship two benches: a reproducible micro-bench you can run on the synthetic 4-page wiki in this repo, and aggregate numbers from an internal end-to-end bench done by 42rows.com on a 1,851-page customer wiki (raw queries and answers stay private; methodology and aggregates are below).

Bench A — easy queries (keyword in title)

5 questions whose keywords are literally in the file or segment name. Worst case for embeddings: a filesystem grep already finds the right document by name.

filesystem grep wiki42 Δ
Input tokens to LLM 16,354 ± 987 2,569 ± 237 −84%
End-to-end latency 10.9 s 10.7 s tie
Answer quality grounded with citations grounded with citations tie

Bench B — hard queries (semantic, no keyword overlap)

5 questions written after reading the wiki content, phrased so the literal words do not appear in document titles. Retrieval has to work by meaning.

filesystem grep wiki42 Δ
Input tokens to LLM 13,489 ± 2,916 2,481 ± 545 −82%
End-to-end latency 7.1 s 9.0 s grep 1.9 s faster
Answer quality (0–3 × 5, max 15) 7 / 15 12 / 15 wiki42 +71%
Hallucinated facts 2 (invented numbers) 0 wiki42 wins

The hard-query gain comes from two failure modes of pure-keyword retrieval: (1) when the question describes a concept by metaphor and not by literal title words, grep returns 20 unrelated documents and the LLM gives up; (2) when the answer is a specific number buried in one page and grep matches on the question's topic words instead, the LLM compensates by inventing a number. wiki42's page-level embeddings land on the right page directly in both cases.

See benchmarks/results.md for the full methodology and benchmarks/bench.py for the reproducible micro-bench harness:

python benchmarks/bench.py --modes parse,local,cloud --runs 3

When wiki42 does not help

If your wiki is small (≤ 30 pages) and your agent already has filesystem access (e.g. Claude Code on your local repo), grep -rli is faster, cheaper, and good enough. wiki42 earns its keep when (a) you ship a wiki to consumers that do not have filesystem access (Claude Desktop, Cursor, Cline, hosted apps), (b) the queries are semantic rather than keyword-matched, or (c) the same wiki gets queried by many people or many times.


What's in the bundle

compile_wiki() returns Chunk objects shaped like:

{
  "id":       "companies/acme-logistics#0",         // stable across recompiles
  "text":     "passage: Acme Logistics S.p.A. ...", // E5 expects this prefix
  "embedding": [0.012, -0.089, ...],                // 1024-dim float, or null
  "metadata": {
    "slug":           "companies/acme-logistics",
    "title":          "Acme Logistics S.p.A.",
    "kind":           "company",                    // any frontmatter field
    "confidence":     0.88,
    "last_touched":   "2026-05-10",
    "wikilinks_out":  ["segments/retail-warehouse", "products/wms-suite"],
    "section_path":   "",                           // populated if --split-h2
    "char_count":     2147
  }
}

Frontmatter fields land in metadata as-is — your vector DB's metadata filter can target any of them (kind=="decision" AND confidence>0.8).


Architecture

wiki42/
├── src/wiki42/                              ← the library
│   ├── __init__.py     — public surface
│   ├── compile.py      — compile_wiki()
│   ├── cli.py          — wiki42 compile
│   ├── _parser.py      — markdown + frontmatter + [[wikilinks]]
│   └── _embedder.py    — E5 cloud + local backend
├── tests/                                   ← pytest suite (no network)
├── examples/                                ← copy-paste snippets
├── benchmarks/                              ← bench harness + sample wiki
└── pyproject.toml

The library has no runtime dependency on any other package in the org. Drop the wheel into any project; everything else (Apify hosting, MCP server) is an optional integration.


Roadmap

  • v0.1 — library + CLI, page-level chunks, honest bench
  • v0.2ask_wiki(source, question) convenience wrapper
  • v0.3 — MCP server (Claude Desktop / Cursor / Cline) rebuilt on top of the library, 4-tool API
  • v1.0 — stable Python API, semantic versioning

License

MIT — see LICENSE.

Built and maintained by 42rows S.r.l. — sales intelligence with AI agents. Author: Mario Brosco42rows.com.

Want to support the project? Star ⭐ this repo, share it, or check out 42rows.com.

About

Compile a markdown wiki into RAG-ready chunks for any vector database. One chunk per page, frontmatter as metadata, multilingual E5 embeddings. MIT. Python 3.10+.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors