OpenKB — Open LLM Knowledge Base

OpenKB — Open LLM Knowledge Base

Scale to long documents • Reasoning-based retrieval • Native multi-modality • No Vector DB

📑 What is OpenKB

OpenKB (Open Knowledge Base) is an open-source system (in CLI) that compiles raw documents into a structured, interlinked wiki-style knowledge base using LLMs, powered by PageIndex for vectorless long document retrieval.

The idea is based on a concept described by Andrej Karpathy: LLMs generate summaries, concept pages, and cross-references, all maintained automatically. Knowledge compounds over time instead of being re-derived on every query.

Why not traditional RAG?

Traditional RAG rediscovers knowledge from scratch on every query. Nothing accumulates. OpenKB compiles knowledge once into a persistent wiki, then keeps it current. Cross-references already exist. Contradictions are flagged. Synthesis reflects everything consumed.

Features

Broad format support — PDF, Word, Markdown, PowerPoint, HTML, Excel, text, and more via markitdown
Scale to long documents — Long and complex documents are handled via PageIndex tree indexing, enabling accurate, vectorless long-context retrieval
Native multi-modality — Retrieves and understands figures, tables, and images, not just text
Compiled Wiki — LLM manages and compiles your documents into summaries, concept pages, and cross-links, all kept in sync
Query — Ask questions (one-off) against your wiki. The LLM navigates your compiled knowledge to answer
Interactive Chat — Multi-turn conversations with persisted sessions you can resume across runs
Lint — Health checks find contradictions, gaps, orphans, and stale content
Watch mode — Drop files into raw/, wiki updates automatically
Obsidian compatible — Wiki is plain .md files with [[wikilinks]]. Open in Obsidian for graph view and browsing

🚀 Getting Started

Install

pip install openkb

Other install options

Latest from GitHub:

pip install git+https://github.com/VectifyAI/OpenKB.git

Install from source (editable, for development):

git clone https://github.com/VectifyAI/OpenKB.git
cd OpenKB
pip install -e .

Quick Start

# 1. Create a directory for your knowledge base
mkdir my-kb && cd my-kb

# 2. Initialize the knowledge base
openkb init

# 3. Add documents
openkb add paper.pdf
openkb add ~/papers/  # Add a whole directory

# 4. Ask a question
openkb query "What are the main findings?"

# 5. Or chat interactively
openkb chat

Set up your executor

OpenKB runs its LLM steps through local subprocess executors instead of an API-key runtime. Pick the provider you want during openkb init, or edit .openkb/config.yaml directly.

Supported executor styles include:

provider: claude with models like sonnet, opus, or claude-sonnet-4-6
provider: codex_app with models like gpt-5.4-mini or gpt-5.4
provider: codex with the same OpenAI-family model names
provider: ollama with local model names like llama3, mistral, or qwen2

Legacy prefixed model strings such as anthropic/claude-sonnet-4-6 remain readable in config files, but OpenKB normalizes them before calling the executor.

🧩 How OpenKB Works

Architecture

raw/                              You drop files here
 │
 ├─ Short docs ──→ markitdown ──→ LLM reads full text
 │                                     │
 ├─ Long PDFs ──→ PageIndex ────→ LLM reads document trees
 │                                     │
 │                                     ▼
 │                         Wiki Compilation (using LLM)
 │                                     │
 ▼                                     ▼
wiki/
 ├── index.md            Knowledge base overview
 ├── log.md              Operations timeline
 ├── AGENTS.md           Wiki schema (LLM instructions)
 ├── sources/            Full-text conversions
 ├── summaries/          Per-document summaries
 ├── concepts/           Cross-document synthesis ← the good stuff
 ├── explorations/       Saved query results
 └── reports/            Lint reports

Short vs. Long Document Handling

	Short documents	Long documents (PDF ≥ 20 pages)
Convert	markitdown → Markdown	PageIndex → tree index + summaries
Images	Extracted inline (pymupdf)	Extracted by PageIndex
LLM reads	Full text	Document trees
Result	summary + concepts	summary + concepts

Short docs are read in full by the LLM. Long PDFs are indexed by PageIndex into a hierarchical tree with summaries. The LLM reads the tree instead of the full text, enabling better retrieval from long documents.

Knowledge Compilation

When you add a document, the LLM:

Generates a summary page
Reads existing concept pages
Creates or updates concepts with cross-document synthesis
Updates the index and log

A single source might touch 10-15 wiki pages. Knowledge accumulates: each document enriches the existing wiki rather than sitting in isolation.

⚙️ Usage

Commands

Command	Description
`openkb init`	Initialize a new knowledge base (interactive)
`openkb add <file_or_dir>`	Add documents and compile to wiki
`openkb query "question"`	Ask a question over the knowledge base (use `--save` to save the answer to `wiki/explorations/`)
`openkb chat`	Start an interactive multi-turn chat (use `--resume`, `--list`, `--delete` to manage sessions)
`openkb watch`	Watch `raw/` and auto-compile new files
`openkb lint`	Run structural + knowledge health checks
`openkb list`	List indexed documents and concepts
`openkb status`	Show knowledge base stats

Interactive Chat

openkb chat opens an interactive chat session over your wiki knowledge base. Unlike the one-shot openkb query, each turn carries the conversation history, so you can dig into a topic without re-typing context.

openkb chat                       # start a new session
openkb chat --resume              # resume the most recent session
openkb chat --resume 20260411     # resume by id (unique prefix works)
openkb chat --list                # list all sessions
openkb chat --delete <id>         # delete a session

Inside a chat, type / to access slash commands (Tab to complete):

/help — list available commands
/status — show knowledge base status
/list — list all documents
/add <path> — add a document or directory without leaving the chat
/save [name] — export the transcript to wiki/explorations/
/clear — start a fresh session (the current one stays on disk)
/lint — run knowledge base lint
/exit — exit (Ctrl-D also works)

Configuration

Settings are initialized by openkb init, and stored in .openkb/config.yaml:

provider: claude                # Local executor provider
model: sonnet                   # Model name for that executor
effort: medium                  # Executor reasoning effort
language: en                     # Wiki output language
pageindex_threshold: 20          # PDF pages threshold for PageIndex

Typical provider/model combinations:

Provider	Model example
Claude executor	`sonnet`, `opus`, `claude-sonnet-4-6`
Codex App executor	`gpt-5.4-mini`, `gpt-5.4`
Codex executor	`gpt-5.4-mini`, `gpt-5.4`
Ollama executor	`llama3`, `mistral`, `qwen2`

PageIndex Integration

Long documents are challenging for LLMs due to context limits, context rot, and summarization loss. PageIndex solves this with vectorless, reasoning-based retrieval — building a hierarchical tree index that lets LLMs reason over the index for context-aware retrieval.

PageIndex runs locally by default using the open-source version, with no external dependencies required.

Optional: Cloud Support

For large or complex PDFs, PageIndex Cloud can be used to access additional capabilities, including:

OCR support for scanned PDFs (via hosted VLM models)
Faster structure generation
Scalable indexing for large documents

Set PAGEINDEX_API_KEY in your .env to enable cloud features:

PAGEINDEX_API_KEY=your_pageindex_api_key

AGENTS.md

The wiki/AGENTS.md file defines wiki structure and conventions. It's the LLM's instruction manual for maintaining the wiki. Customize it to change how your wiki is organized.

At runtime, the LLM reads AGENTS.md from disk, so your edits take effect immediately.

Using with Obsidian

OpenKB's wiki is a directory of Markdown files with [[wikilinks]]. Obsidian renders it natively.

Open wiki/ as an Obsidian vault
Browse summaries, concepts, and explorations
Use graph view to see knowledge connections
Use Obsidian Web Clipper to add web articles to raw/

🧭 Learn More

Compared to Karpathy's Approach

	Karpathy's workflow	OpenKB
Short documents	LLM reads directly	markitdown → LLM reads
Long documents	Context limits, context rot	PageIndex tree index
Supported formats	Web clipper → .md	PDF, Word, PPT, Excel, HTML, text, CSV, .md
Wiki compilation	LLM agent	LLM agent (same)
Q&A	Query over wiki	Wiki + PageIndex retrieval

The Stack

PageIndex — Vectorless, reasoning-based document indexing and retrieval
markitdown — Universal file-to-markdown conversion
Local executor CLIs — claude, codex, or ollama-backed execution
Click — CLI framework
watchdog — Filesystem monitoring

Roadmap

Extend long document handling to non-PDF formats
Scale to large document collections with nested folder support
Hierarchical concept (topic) indexing for massive knowledge bases
Database-backed storage engine
Web UI for browsing and managing wikis

Contributing

Contributions are welcome! Please submit a pull request, or open an issue for bugs or feature requests. For larger changes, consider opening an issue first to discuss the approach.

License

Apache 2.0. See LICENSE.

Support Us

If you find OpenKB useful, please give us a star 🌟 — and check out PageIndex too!

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
.github/workflows		.github/workflows
docs/plans		docs/plans
examples/docs		examples/docs
openkb		openkb
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG-llm-wiki-port.md		CHANGELOG-llm-wiki-port.md
DESIGN-llm-wiki-port.md		DESIGN-llm-wiki-port.md
LICENSE		LICENSE
PLAN-phase0-korean-entity-type.md		PLAN-phase0-korean-entity-type.md
README.md		README.md
config.yaml.example		config.yaml.example
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenKB — Open LLM Knowledge Base

📑 What is OpenKB

Why not traditional RAG?

Features

🚀 Getting Started

Install

Quick Start

Set up your executor

🧩 How OpenKB Works

Architecture

Short vs. Long Document Handling

Knowledge Compilation

⚙️ Usage

Commands

Interactive Chat

Configuration

PageIndex Integration

Optional: Cloud Support

AGENTS.md

Using with Obsidian

🧭 Learn More

Compared to Karpathy's Approach

The Stack

Roadmap

Contributing

License

Support Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenKB — Open LLM Knowledge Base

📑 What is OpenKB

Why not traditional RAG?

Features

🚀 Getting Started

Install

Quick Start

Set up your executor

🧩 How OpenKB Works

Architecture

Short vs. Long Document Handling

Knowledge Compilation

⚙️ Usage

Commands

Interactive Chat

Configuration

PageIndex Integration

Optional: Cloud Support

AGENTS.md

Using with Obsidian

🧭 Learn More

Compared to Karpathy's Approach

The Stack

Roadmap

Contributing

License

Support Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages