wiki42

Compile a markdown wiki into RAG-ready chunks. Drop-in replacement for Pinecone server-side embedding, but markdown-aware.

Open-source library from 42rows.com, the AI sales-intelligence platform that ships AI agents grounded on customer-specific wikis. We extracted the part that turns a wiki into vectors so any agent — yours, ours, anyone's — can consume it.

from wiki42 import compile_wiki

chunks = compile_wiki("./my-wiki/")            # one chunk per page
# → load into Pinecone / FAISS / Chroma / Qdrant / Weaviate

One sentence. Smart about [[wikilink]] graphs and YAML frontmatter. One vector per page (no chunk explosion, no source-bias amplification). Multilingual E5 embeddings out of the box. −84% tokens to your LLM vs feeding the matched files directly, identical or better answer quality depending on query type, no vendor lock-in.

What this is (and isn't)

It is a parser + embedder for markdown wikis. Input: a directory, a .zip, or an https:// URL. Output: a list of {id, text, embedding, metadata} dicts that any vector store can consume.

It isn't a vector store, a retrieval framework, an orchestrator, or an MCP server. It is one focused step of a RAG pipeline. Pick the vector store, retrieval policy, and downstream LLM that fit your use case — wiki42 hands you portable chunks and gets out of the way. An MCP server on top of the library is on the roadmap (v0.3) but is not what this package ships today.

Why not just use Pinecone's `upsert_from_text_files`?

	Pinecone server-side embedding	wiki42
Chunking	token-based, cuts mid-sentence	one chunk per page (atomic unit of meaning)
YAML frontmatter	ignored	extracted to typed metadata (filterable)
`[[wikilink]]` graph	ignored	`metadata.wikilinks_out` edge list
Embedding model	`text-embedding-3-small` (English-leaning)	E5 multilingual large (IT/EN/DE/FR/ES/…)
Bias amplification	possible (split chunks repeat source bias)	none (full-page context, no LLM teacher)
Lock-in	vectors live in Pinecone	zero — parquet/jsonl, use any vector DB
Cost	$70/mo serverless minimum	$0 self-host (or pay-per-compile Apify actor)

Quick start

pip install wiki42
wiki42 compile ./my-wiki/ --out chunks.parquet

Then drop into your vector store of choice (5 lines):

import polars as pl
from pinecone import Pinecone

chunks = pl.read_parquet("chunks.parquet")
index = Pinecone(api_key="...").Index("my-wiki")
index.upsert(vectors=[(c["id"], c["embedding"], c["metadata"]) for c in chunks.to_dicts()])

Same idea with FAISS, Chroma, Qdrant, Weaviate, or local NumPy. The library hands you portable data — what you do with it is your call.

Python library

from wiki42 import compile_wiki

# 1-page-1-chunk (default), Pinecone Inference embeddings (cloud)
result = compile_wiki("./wiki/", embedding_model="pinecone:multilingual-e5-large")

# Local sentence-transformers (offline, no API)
result = compile_wiki("./wiki/", embedding_model="intfloat/multilingual-e5-large")

# Skip embeddings, just parse
result = compile_wiki("./wiki/", embedding_model=None)

print(result.pages_parsed, result.chunks_generated, result.compile_time_s)
for chunk in result.chunks:
    print(chunk.id, chunk.metadata["title"], chunk.metadata["wikilinks_out"])

From a remote zip

wiki42 compile https://github.com/user/wiki/archive/main.zip --out chunks.parquet

GitHub-style archives are unwrapped automatically (the top <repo>-<branch>/ folder is detected and skipped).

Benchmark — honest, end-to-end

We ship two benches: a reproducible micro-bench you can run on the synthetic 4-page wiki in this repo, and aggregate numbers from an internal end-to-end bench done by 42rows.com on a 1,851-page customer wiki (raw queries and answers stay private; methodology and aggregates are below).

Bench A — easy queries (keyword in title)

5 questions whose keywords are literally in the file or segment name. Worst case for embeddings: a filesystem grep already finds the right document by name.

	filesystem grep	wiki42	Δ
Input tokens to LLM	16,354 ± 987	2,569 ± 237	−84%
End-to-end latency	10.9 s	10.7 s	tie
Answer quality	grounded with citations	grounded with citations	tie

Bench B — hard queries (semantic, no keyword overlap)

5 questions written after reading the wiki content, phrased so the literal words do not appear in document titles. Retrieval has to work by meaning.

	filesystem grep	wiki42	Δ
Input tokens to LLM	13,489 ± 2,916	2,481 ± 545	−82%
End-to-end latency	7.1 s	9.0 s	grep 1.9 s faster
Answer quality (0–3 × 5, max 15)	7 / 15	12 / 15	wiki42 +71%
Hallucinated facts	2 (invented numbers)	0	wiki42 wins

The hard-query gain comes from two failure modes of pure-keyword retrieval: (1) when the question describes a concept by metaphor and not by literal title words, grep returns 20 unrelated documents and the LLM gives up; (2) when the answer is a specific number buried in one page and grep matches on the question's topic words instead, the LLM compensates by inventing a number. wiki42's page-level embeddings land on the right page directly in both cases.

See benchmarks/results.md for the full methodology and benchmarks/bench.py for the reproducible micro-bench harness:

python benchmarks/bench.py --modes parse,local,cloud --runs 3

When wiki42 does not help

If your wiki is small (≤ 30 pages) and your agent already has filesystem access (e.g. Claude Code on your local repo), grep -rli is faster, cheaper, and good enough. wiki42 earns its keep when (a) you ship a wiki to consumers that do not have filesystem access (Claude Desktop, Cursor, Cline, hosted apps), (b) the queries are semantic rather than keyword-matched, or (c) the same wiki gets queried by many people or many times.

What's in the bundle

compile_wiki() returns Chunk objects shaped like:

{
  "id":       "companies/acme-logistics#0",         // stable across recompiles
  "text":     "passage: Acme Logistics S.p.A. ...", // E5 expects this prefix
  "embedding": [0.012, -0.089, ...],                // 1024-dim float, or null
  "metadata": {
    "slug":           "companies/acme-logistics",
    "title":          "Acme Logistics S.p.A.",
    "kind":           "company",                    // any frontmatter field
    "confidence":     0.88,
    "last_touched":   "2026-05-10",
    "wikilinks_out":  ["segments/retail-warehouse", "products/wms-suite"],
    "section_path":   "",                           // populated if --split-h2
    "char_count":     2147
  }
}

Frontmatter fields land in metadata as-is — your vector DB's metadata filter can target any of them (kind=="decision" AND confidence>0.8).

Architecture

wiki42/
├── src/wiki42/                              ← the library
│   ├── __init__.py     — public surface
│   ├── compile.py      — compile_wiki()
│   ├── cli.py          — wiki42 compile
│   ├── _parser.py      — markdown + frontmatter + [[wikilinks]]
│   └── _embedder.py    — E5 cloud + local backend
├── tests/                                   ← pytest suite (no network)
├── examples/                                ← copy-paste snippets
├── benchmarks/                              ← bench harness + sample wiki
└── pyproject.toml

The library has no runtime dependency on any other package in the org. Drop the wheel into any project; everything else (Apify hosting, MCP server) is an optional integration.

Roadmap

v0.1 — library + CLI, page-level chunks, honest bench
v0.2 — ask_wiki(source, question) convenience wrapper
v0.3 — MCP server (Claude Desktop / Cursor / Cline) rebuilt on top of the library, 4-tool API
v1.0 — stable Python API, semantic versioning

License

MIT — see LICENSE.

Built and maintained by 42rows S.r.l. — sales intelligence with AI agents. Author: Mario Brosco — 42rows.com.

Want to support the project? Star ⭐ this repo, share it, or check out 42rows.com.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
benchmarks		benchmarks
examples		examples
src/wiki42		src/wiki42
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wiki42

What this is (and isn't)

Why not just use Pinecone's `upsert_from_text_files`?

Quick start

Python library

From a remote zip

Benchmark — honest, end-to-end

Bench A — easy queries (keyword in title)

Bench B — hard queries (semantic, no keyword overlap)

When wiki42 does not help

What's in the bundle

Architecture

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

wiki42

What this is (and isn't)

Why not just use Pinecone's upsert_from_text_files?

Quick start

Python library

From a remote zip

Benchmark — honest, end-to-end

Bench A — easy queries (keyword in title)

Bench B — hard queries (semantic, no keyword overlap)

When wiki42 does not help

What's in the bundle

Architecture

Roadmap

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why not just use Pinecone's `upsert_from_text_files`?

Packages