A local Retrieval-Augmented Generation (RAG) app. Upload PDF or text documents, ask questions in natural language, and get answers generated by Claude that are grounded in your documents and backed by numbered citations pointing at the exact source and page.
Embeddings run locally (no embedding API calls); only answer generation uses the Anthropic Claude API.
+-------------------------------------------------+
| app.py (Streamlit UI) |
| sidebar: upload | doc list | clear KB | API key|
| main: query box | answer | cited sources |
+-------+-------------------------+----------------+
| |
ingest | | query
v v
+------------------+ +------------------+
| ingest.py | | retriever.py |
| load -> chunk -> | | embed query -> |
| embed -> upsert | | similarity search|
+--------+---------+ +--------+---------+
| |
v v
+-------------------------------------------------+
| ChromaDB (persistent) + all-MiniLM-L6-v2 |
+-------------------------------------------------+
| top-k chunks
v
+------------------+
| rag.py | build context + citations -> Claude -> answer
+------------------+
^
config.py (shared constants)
| File | Responsibility |
|---|---|
config.py |
Single source of truth for all tunable constants. |
ingest.py |
Write path: load PDF/txt, chunk, embed, upsert into ChromaDB. |
retriever.py |
Read path: embed the query, run vector similarity search. |
rag.py |
LLM path: build the cited-context prompt and call the Claude API. |
app.py |
UI only: orchestrates the modules above; holds no business logic. |
- Ingest — Each document is split into overlapping, token-bounded chunks.
Every chunk is embedded locally with
all-MiniLM-L6-v2and stored in a persistent ChromaDB collection along with itssourceandpagemetadata. - Retrieve — The question is embedded with the same model; ChromaDB returns
the top-
k(default 5) most similar chunks by cosine similarity. - Generate — Those chunks are formatted into a numbered context block and
sent to Claude with instructions to answer only from the context and to
cite each claim with
[n]. The UI shows the answer plus an expandable, numbered list of the exact source chunks (with page and similarity score).
Requires Python 3.10+.
A setup script creates the virtual environment, installs dependencies, and seeds
a .env file for you:
# Windows (PowerShell)
.\setup.ps1
# if you get an execution-policy error, run instead:
# powershell -ExecutionPolicy Bypass -File setup.ps1
# macOS / Linux
bash setup.sh# 1. Create and activate a virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1 # Windows (PowerShell)
source .venv/bin/activate # macOS/Linux
# 2. Install dependencies
pip install -r requirements.txt
# 3. Seed your environment file
cp .env.example .env # Windows: copy .env.example .envEither way, the API key is optional at setup time: add your Anthropic key to
.env (get one at https://console.anthropic.com/), or just leave it blank and
the app will prompt for it in the sidebar on first use.
streamlit run app.pyFirst run note: on the very first launch, sentence-transformers downloads the
all-MiniLM-L6-v2embedding model (~90 MB). The app may appear to pause for a few seconds while this happens; the model is cached locally for every run afterward.
Then, in the browser:
- Upload one or more PDF / .txt files in the sidebar and click Ingest documents.
- Type a question in the main panel and click Ask.
- Read the grounded answer and expand the Sources to verify each citation.
Ingested documents persist on disk (in chroma_db/) between runs. Use
Clear knowledge base in the sidebar to wipe everything.
- Why 256-token chunks?
all-MiniLM-L6-v2truncates its input at 256 tokens. Chunking any larger would mean the tail of every chunk never influences its own embedding, silently degrading retrieval. Chunks are therefore sized to the embedder's real limit (256 tokens, 50-token overlap), measured with the embedding model's own tokenizer so the units line up exactly. - Real vector search, not keywords. Retrieval is cosine similarity over dense embeddings, so a question like "how did sales do?" can match a chunk that says "revenue increased 12%" even with no shared words.
- Local embeddings. All embedding happens on-device via sentence-transformers, so document text is only sent to a third party at answer-generation time.
- Grounded + cited. The system prompt forbids outside knowledge and requires
[n]citations; if the retrieved context is insufficient, the model is told to say so rather than guess.
- UI: Streamlit
- Embeddings: sentence-transformers (
all-MiniLM-L6-v2) - Vector store: ChromaDB (persistent, local)
- LLM: Anthropic Claude (
claude-sonnet-4-6) - PDF parsing: pypdf