Local RAG (Retrieval-Augmented Generation) CLI for searching a folder of PDF books or papers.
Key ideas:
- Base scenario is fully local usage β no cloud services. Ollama for embeddings and LLM inference, ChromaDB for vector storage.
- Keeping things as simple as it can be
How it works:
your PDFs β pedro index β vector DB β pedro ask β answer
- You have a folder of PDF books
pedro indexreads them, splits into chunks, embeds each chunk, stores in a local vector DBpedro askembeds your question, finds the most relevant chunks, sends them to a local LLM, streams the answer back
Nothing leaves your machine.
After playing with parameters and models, when you get pedro ask to work quickly
you can start using pedro research for multi-step reasoning.
Quickstart (in ideal world with python 3.13 and Ollama running on local network):
# 1. Clone and install
git clone https://github.com/Nufeen/pdf-rag.git && cd pdf-rag
uv venv && source .venv/bin/activate && uv pip install -e .
# 2. Point to your Ollama host and configure
cp .env.example .env
# edit .env β set OLLAMA_BASE_URL=http://<your-ollama-host>:11434
# 3. Index your books
pedro index ~/Books/
# 4. Ask a question
pedro ask "What is backpropagation?"
# 5. Deep research (multi-step reasoning)
pedro research "Compare symbolic and connectionist approaches to AI"That's it. Re-run pedro index ~/Books/ whenever you add new PDFs β only new files are processed.
| Component | Choice | Reason |
|---|---|---|
| PDF extraction | PyMuPDF (fitz) |
Fast, page-level metadata, handles most encodings |
| Chunking | Custom recursive splitter | Split on \n\n β \n β . β to preserve semantics |
| Embeddings | nomic-embed-text via Ollama |
See model recomendations below |
| Vector DB | ChromaDB (embedded/persistent) | No server, persists to disk, metadata filtering built-in |
| LLM | any via Ollama | See recomendations below |
| Ollama host | Remote (local network) | Set via OLLAMA_BASE_URL=http://<host-ip>:11434 |
| CLI | Click + Textual (adr 1 for details) | https://click.palletsprojects.com/en/stable/ |
| Framework | None (raw components) | RAG pipeline is simple; no LlamaIndex/LangChain overhead |
See tests/README.md for setup and how to run tests.
pdf-rag/
βββ pdf_rag/
β βββ __init__.py
β βββ cli.py # Click entry point: `index` and `ask` commands
β βββ config.py # Constants + env var overrides
β βββ chunker.py # Recursive text splitter
β βββ indexer.py # PDF extraction, chunking, embedding, ChromaDB writes
β βββ retriever.py # Query embedding + ChromaDB search
β βββ llm.py # Prompt construction + Ollama streaming chat
βββ requirements.txt
βββ pyproject.toml # entry_points: `pdf-rag = pdf_rag.cli:cli`
βββ README.md # setup, usage, reindexing workflow, env vars
ChromaDB is stored at ~/.pdf-rag/chroma_db by default (overridable via --db-path or DB_PATH env var).
On the machine running Ollama:
# Allow network access (add to ~/.bashrc or systemd service)
export OLLAMA_HOST=0.0.0.0
ollama pull mxbai-embed-large
ollama pull command-r:35b # or see model recommendations aboveOn machine with project running, point to the remote Ollama host:
export OLLAMA_BASE_URL=http://192.168.1.X:11434 # replace with actual IP| Model | Dims | Notes |
|---|---|---|
mxbai-embed-large |
1024 | Recommended β better retrieval quality than nomic |
nomic-embed-text |
768 | Default, fast, good baseline |
bge-m3 |
1024 | Best quality, multilingual, slightly slower |
| Model | Size | Context | Notes |
|---|---|---|---|
command-r:35b |
35B | 128k | Recommended β fine-tuned specifically for RAG, native citation support |
mixtral:8x7b |
47B MoE | 32k | Excellent quality, fast due to MoE architecture |
llama3.1:70b |
70B | 128k | Best reasoning if VRAM allows |
qwen2.5:32b |
32B | 128k | Strong choice for non-English books |
mistral:7b |
7B | 8k | Minimum viable, good for low-memory hosts |
command-r:35b is the best fit for RAG specifically β it's trained to ground answers in retrieved context
and produce accurate citations rather than hallucinate beyond the provided excerpts.
The recommended way to manage Python on macOS is uv β it handles Python installation, virtual environments, and dependencies in one tool.
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install Python 3.13 and pin it for this project
uv python install 3.13
uv python pin 3.13
# Create venv and install dependencies
uv venv
source .venv/bin/activate
uv pip install -e .Then proceed with the Installation steps below.
Note
Do not forget to allow local network usage for terminal sessions in mac os! If you get "No route to host" error with local network ollama probably thats it
uv venv
source .venv/bin/activate
uv pip install -e .Copy and edit the env file:
cp .env.example .env
# edit .env β set OLLAMA_BASE_URL to your Ollama host IPScan a folder and index all PDFs:
pedro index ~/Books/Output:
Embedding: deep_learning.pdf (1847 chunks)...
Indexed: deep_learning.pdf (1847 chunks)
Embedding: pattern_recognition.pdf (2103 chunks)...
Indexed: pattern_recognition.pdf (2103 chunks)
Indexing is incremental β each file's SHA-256 hash is stored. Re-running the command only processes new or changed files:
pedro index ~/Books/
# Skipping (already indexed): deep_learning.pdf
# Skipping (already indexed): pattern_recognition.pdfDrop the new PDF into the folder and re-run the index command:
cp new_book.pdf ~/Books/
pedro index ~/Books/
# Skipping (already indexed): deep_learning.pdf
# Skipping (already indexed): pattern_recognition.pdf
# Embedding: new_book.pdf (1523 chunks)...
# Indexed: new_book.pdf (1523 chunks)Only the new file is processed. Existing books are skipped instantly.
Note
Take care of library size, ChromaDB has RAM limitations If index size will be greater than machine RAM it can trigger errors: chroma-core/chroma#1323 For now it will work only with reasonable local library sizes
pedro ask "What is the vanishing gradient problem?"Output:
Retrieved sources:
- deep_learning.pdf (page 289, score: 0.912)
- deep_learning.pdf (page 291, score: 0.887)
- pattern_recognition.pdf (page 144, score: 0.743)
Answer:
The vanishing gradient problem occurs when... [Book: deep_learning.pdf, Page: 289]
Residual connections solve this by... [Book: deep_learning.pdf, Page: 291]
Hide the source list:
pedro ask "Explain attention mechanisms" --no-sourcesRetrieve more context chunks:
pedro ask "Compare LSTM and GRU" --top-k 8Override model or embedding per-run:
pedro ask "What is entropy?" --deep-model llama3.1:70b
pedro ask "What is entropy?" --embed-model bge-m3 --top-k 8Available flags: --deep-model, --embed-model, --ollama-url, --top-k, --no-sources, --db-path
For complex questions that need multi-angle reasoning, use pedro research. It decomposes the question into sub-questions, answers each via RAG, then synthesizes and iteratively refines the result.
pedro research "What are the fundamental differences between symbolic and connectionist AI?"Output:
πͺ
Planning 3 sub-questions...
1. What is symbolic AI and what are its core assumptions?
2. What is connectionist AI and how do neural networks differ from symbolic systems?
3. What are the practical tradeoffs between the two approaches?
πͺ
Executing 3 sub-question(s)...
[1/3] What is symbolic AI...
[2/3] What is connectionist AI...
[3/3] What are the practical tradeoffs...
πͺ
Reflecting (pass 1/1)...
β 1 follow-up sub-question(s) identified
πͺ
Synthesizing final answer...
The fundamental differences between symbolic and connectionist AI...
Pipeline steps:
| Step | What happens | Model |
|---|---|---|
| Plan | Decomposes the question into N focused sub-questions | TINY_MODEL |
| Execute | For each sub-question: retrieves chunks from vector DB, generates a partial answer | FAST_MODEL |
| Reflect | Evaluates completeness; identifies gaps or follow-up questions. Repeats Execute if needed, up to --depth passes |
TINY_MODEL |
| Synthesize | Combines all findings into a final answer with citations | DEEP_MODEL |
| Sources | Lists PDF files and page numbers from all retrieved chunks | β |
| Referenced in chunks | Extracts author names, paper/book titles, URLs mentioned inside the retrieved text | FAST_MODEL |
| Model's take | Brief perspective from the model's own training knowledge, independent of the PDFs | DEEP_MODEL |
Control depth and breadth:
pedro research "Explain attention mechanisms" --depth 1 # single pass, no reflection
pedro research "Compare LSTM, GRU and Transformer" --depth 3 --sub-questions 5Override models per-run:
pedro research "..." --deep-model llama3.1:70b --fast-model mistral:7b --tiny-model qwen2.5:3bAvailable flags: --deep-model, --fast-model, --tiny-model, --embed-model, --ollama-url, --depth, --sub-questions, --top-k, --languages, --translate-model, --db-path
Configure via .env:
RESEARCH_DEPTH=2
RESEARCH_N_SUBQUESTIONS=3
If your PDF collection contains books in multiple languages, retrieval quality drops when the query language differs from the document language. There are two ways to handle this.
Switch to an embedding model that maps all languages into the same vector space. No translation step, no extra latency per query β but requires a full re-index.
# Pull a multilingual embedding model
ollama pull bge-m3 # already listed in the embedding table above
# Set it in .env
EMBED_MODEL=bge-m3
# Re-index everything
pedro index ~/Books/ --forcebge-m3 handles 100+ languages and is the cleanest long-term solution.
If you want to keep the existing index, enable query translation. For each sub-question in pedro research, pedro will translate the query into each configured language and merge the results before generating an answer.
# In .env
SEARCH_LANGUAGES=Russian,French
TRANSLATE_MODEL=qwen2.5:3b # any small model works; defaults to TINY_MODELOr pass per-run via CLI:
pedro research "What is entropy?" --languages Russian,French
pedro research "What is entropy?" --languages Russian --translate-model qwen2.5:3bDuring research, translated queries are shown inline in the log:
πͺ
Executing 3 sub-question(s)...
[1/3] What is entropy?
(β Russian: Π§ΡΠΎ ΡΠ°ΠΊΠΎΠ΅ ΡΠ½ΡΡΠΎΠΏΠΈΡ?)
(β French: Qu'est-ce que l'entropie ?)
Notes:
- Translation only runs in
pedro research, not inpedro ask - Each language adds one extra embedding + retrieval call per sub-question
- The translation model needs to be pulled:
ollama pull qwen2.5:3b - Chunks retrieved across languages are deduplicated before answer generation
| Multilingual embeddings | Query translation | |
|---|---|---|
| Re-index required | Yes | No |
| Extra latency per query | None | 1 LLM call Γ N languages per subquestion |
Works in pedro ask |
Yes | No |
Works in pedro research |
Yes | Yes |
If you are starting fresh or can afford a re-index, use bge-m3. If you have an existing index and want to extend coverage without re-indexing, use SEARCH_LANGUAGES.
Pedro can run as an HTTP server, exposing the ask and research pipelines as streaming endpoints. This lets the TUI, a web frontend, or any other client connect to a single running instance.
pedro serve # binds to 127.0.0.1:8000
pedro serve --host 0.0.0.0 --port 9000| Method | Path | Description |
|---|---|---|
POST |
/v1/ask |
Single-step RAG answer, streamed as SSE |
POST |
/v1/research |
Multi-step deep research, streamed as SSE |
All request fields are optional β the server falls back to its config defaults.
# Smoke test
curl -N -X POST http://localhost:8000/v1/ask \
-H "Content-Type: application/json" \
-d '{"question": "What is entropy?"}'Each response is a stream of Server-Sent Events with two event types:
event: log
data: {"text": "πͺ
Planning sub-questions..."}
event: token
data: {"text": "The answer is"}
event: done
data: {}
log events carry pipeline status messages (same as what you see in the TUI).
token events carry individual answer tokens.
done signals end of stream.
Set PEDRO_SERVER_URL and the TUI will connect to the server instead of running the pipeline in-process:
# Terminal 1 β server
pedro serve
# Terminal 2 β TUI client
PEDRO_SERVER_URL=http://localhost:8000 pedroWithout PEDRO_SERVER_URL the TUI continues to work standalone (no server needed).
| Variable | Default | Description |
|---|---|---|
PEDRO_SERVER_URL |
`` (standalone) | If set, TUI connects to this server instead of running locally |
Use --force to re-index all files, for example after changing chunk size:
pedro index ~/Books/ --force| Variable | Default | Description |
|---|---|---|
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama host URL |
DB_PATH |
~/.pdf-rag/chroma_db |
ChromaDB storage path |
EMBED_MODEL |
nomic-embed-text |
Ollama embedding model (recommend mxbai-embed-large) |
DEEP_MODEL |
mistral:7b |
Quality model β ask and final research synthesis (recommend command-r:35b) |
FAST_MODEL |
DEEP_MODEL |
Medium model β per-sub-question answers and intermediate synthesis |
TINY_MODEL |
FAST_MODEL |
Fast model β planning and reflection (3B recommended, e.g. qwen2.5:3b) |
CHUNK_SIZE |
800 |
Characters per chunk |
CHUNK_OVERLAP |
150 |
Overlap between chunks |
TOP_K |
5 |
Chunks retrieved per query |
RESEARCH_DEPTH |
2 |
Max reflection iterations for pedro research |
RESEARCH_N_SUBQUESTIONS |
3 |
Sub-questions per iteration for pedro research |
SEARCH_LANGUAGES |
`` (disabled) | Comma-separated languages for query translation in pedro research (e.g. Russian,French) |
TRANSLATE_MODEL |
TINY_MODEL |
Model used to translate sub-questions when SEARCH_LANGUAGES is set |
PEDRO_SERVER_URL |
`` (standalone) | If set, TUI connects to this server instead of running the pipeline in-process |
All variables can also be passed as CLI flags β run pedro index --help, pedro ask --help, or pedro research --help for details.
| Size | Tokens | Best for |
|---|---|---|
| Small | 128β256 | Specific, fact-based questions (FAQ, short answer) |
| Medium | 256β512 | Semantic search, general documentation, RAG chatbot |
| Large | 512β1024 | Summarizing, relationships in content, long-document analysis |
All prompts live in the prompts/ folder. Edit any file directly β changes take effect on the next command, no reinstall needed.
| File | Used by | Model | Placeholders | Purpose |
|---|---|---|---|---|
prompts/answer.txt |
pedro ask |
DEEP_MODEL |
{question}, {context} |
System prompt for answer generation β controls tone, citation format, grounding rules |
prompts/plan_subquestions.txt |
pedro research |
TINY_MODEL |
{question}, {n} |
Instructs the model to decompose the question into N sub-questions |
prompts/reflect.txt |
pedro research |
TINY_MODEL |
{question}, {answer} |
Asks the model to evaluate completeness and identify gaps in the current answer |
prompts/synthesize.txt |
pedro research |
DEEP_MODEL |
{question}, {context} |
Instructs the model to combine all research findings into a final answer |
prompts/extract_citations.txt |
pedro research |
FAST_MODEL |
{context} |
Extracts cited authors, papers, books, and URLs from retrieved chunks |
prompts/own_take.txt |
TUI research mode |
DEEP_MODEL |
{question} |
Asks the model for a brief perspective from its own training knowledge |
prompts/translate_question.txt |
pedro research |
TINY_MODEL |
{text}, {lang} |
Translates a sub-question into a target language (used when SEARCH_LANGUAGES is set) |
Prompt files support {placeholders} filled at runtime. Do not remove placeholders β the tool will fail if they are missing.
Tools and projects to look at in the context of the problem:
I tried some but surprisingly ended with generating own code since in all cases it turned out to be not that easy to get fully local stack working out of the box