A CLI tool that ingests PDF documents into a structured, LLM-maintained Markdown wiki. An LLM reads each document, extracts key concepts, and produces interlinked wiki pages with source citations — building a knowledge base incrementally.
PDF ─→ Marker (OCR + layout) ─→ structured blocks ─→ LLM prompt ─→ wiki pages
- Parse — Marker extracts text with OCR, layout detection, and page numbers
- Prompt — The parsed content is combined with the wiki schema and current index into a single LLM prompt
- Generate — The LLM (Ollama) produces wiki pages with YAML frontmatter, cross-links (
[[page-name]]), and source citations (source.pdf, p.3) - Store — Pages are written to a flat Markdown wiki with an auto-maintained
index.mdand append-onlylog.md
- Python 3.12+
- uv
- Ollama running locally (recommended: native install on macOS for Apple Silicon GPU support)
# Install dependencies
uv sync
# Start Ollama and pull the model
ollama serve &
ollama pull qwen2.5:7b
# Or use Docker (note: requires sufficient memory allocation)
docker compose up -dConfiguration lives in config.yml:
wiki:
dir: wiki
ollama:
model: qwen2.5:7b
url: http://localhost:11434
timeout: 300
num_ctx: 16384Environment variables override config values (precedence: CLI flag > env var > config.yml > defaults):
| Variable | Default | Description |
|---|---|---|
LLM_WIKI_OLLAMA_MODEL |
qwen2.5:7b |
Ollama model name |
LLM_WIKI_OLLAMA_URL |
http://localhost:11434 |
Ollama API base URL |
LLM_WIKI_DIR |
wiki |
Wiki output directory |
# Ingest a single PDF
uv run llm-wiki ingest path/to/document.pdf
# Ingest all PDFs in a directory
uv run llm-wiki ingest-all path/to/pdfs/
# Show wiki status
uv run llm-wiki statusAll CLI options can be overridden via flags:
uv run llm-wiki ingest doc.pdf --model qwen2.5:3b --wiki-dir output/wiki/
├── _schema.md # LLM instructions (page format, rules, output delimiters)
├── index.md # Auto-maintained page catalog
├── log.md # Append-only ingestion log
├── source-summary.md # One per ingested document
├── concept-page.md # Topics spanning multiple sources
└── entity-page.md # Named things (models, standards, locations)
Pages use YAML frontmatter with source tracking and [[wiki-links]] for cross-references.
src/llm_wiki/
├── cli.py # Typer CLI (ingest, ingest-all, status)
├── config.py # YAML config loading with env var overrides
├── parser.py # Marker PDF parsing + LLM text formatting
├── llm.py # LLM Protocol + OllamaLLM implementation
├── ingestion.py # Orchestrator: parse → prompt → generate → store
└── wiki_store.py # Flat-file wiki (read/write pages, index, log)
# Run tests
uv run pytest
# Run only unit tests
uv run pytest -m unit
# Lint
uv run ruff check src/ tests/
# Format
uv run ruff format src/ tests/
# Type check
uv run ty check src/