Local retrieval-augmented generation over a folder of markdown files — in one command. Qdrant for vector storage, sentence-transformers for embeddings, Anthropic Claude for the answer.
The GIF above shows the autonomous part of
make demo— Qdrant up, wait, ingest, verify. Below is what the full demo looks like once you add yourANTHROPIC_API_KEYto.env. Seeguide.mdfor the end-to-end walkthrough including prerequisites, expected output, and common failure modes.
$ make demo
[ingests 12 markdown files into a local Qdrant]
[asks: "How do I lock down a namespace so pods can only talk to each other on explicit ports?"]
[Claude streams an answer grounded in your notes]
Sources:
- 10-network-policies.md > A default-deny baseline
- 10-network-policies.md > Allowing specific flows
- 06-services-and-endpoints.md > How selectors map to endpoints
- Markdown-aware chunking that respects heading boundaries and keeps fenced code blocks atomic.
- Vector search with Qdrant — idempotent ensure-collection, payload index for fast metadata filtering, batched upsert with stable UUIDs.
- Strict-grounding prompt — Claude answers only from retrieved context and says so when the answer isn't in the notes.
- Streaming answer + post-stream citation footer so the user always sees which files the answer came from.
- Production hygiene — friendly errors, exponential-backoff readiness probe, no tracebacks at the CLI surface, no secrets in git history.
RAG · Vector Databases (Qdrant) · Embeddings (sentence-transformers, BGE) · Python · Anthropic Claude API · Docker Compose · Typer CLI · GitHub Actions CI
You need Docker, Python ≥ 3.11, and an Anthropic API key.
git clone <this repo>
cd markdown-rag
cp .env.example .env
# put your Anthropic API key in .env, e.g.:
# ANTHROPIC_API_KEY=sk-ant-...
pip install -r requirements.txt
make demomake demo brings up Qdrant in Docker, waits for it to be ready, ingests
the sample corpus (12 Kubernetes ops notes), and asks the demo question.
First run downloads the embedding model (~130 MB) into the Hugging Face
cache.
You can also run the steps individually:
make up # docker compose up -d (qdrant)
make wait # block until qdrant answers
make ingest # chunk + embed + upsert the corpus
make ask Q="What is the difference between a readiness and liveness probe?"
make reset # drop the collection
make down # docker compose down ┌────────────┐ ┌─────────────────┐ ┌──────────────────┐ ┌────────┐
│ corpus/*.md│───▶│ chunk on H2 │───▶│ embed (BGE-small)│───▶│ Qdrant │
└────────────┘ │ keep code │ │ 384-dim, cosine │ │ │
│ atomic │ └──────────────────┘ └────────┘
└─────────────────┘ ▲
│
┌────────────┐ ┌─────────────────┐ ┌──────────────────┐ │
│ question │───▶│ embed query │───▶│ top-k search │─────────┘
└────────────┘ └─────────────────┘ └──────────────────┘
│
▼
┌──────────────────┐
│ Claude messages. │
│ stream(...) │
│ + Sources footer│
└──────────────────┘
Chunking (src/rag/chunking.py) walks the markdown-it-py token
stream, splits each file on H2 headings (or H1 if a file has no H2),
preserves the heading path as metadata, keeps fenced code blocks
atomic, and applies a soft size cap with one-sentence overlap on
size-driven splits.
Embeddings (src/rag/embeddings.py) use BAAI/bge-small-en-v1.5
— MIT-licensed, 384-dim, retrieval-trained. Qdrant's Distance.COSINE
normalizes server-side, so the client just hands over the raw vectors.
Vector store (src/rag/vector_store.py) wraps qdrant-client's
sync API. ensure_collection is idempotent. A keyword payload index on
source_file is created at first ingest so filter by file queries are
fast — even though the demo doesn't use it, the design choice is there
for when the corpus grows. The connection probe uses exponential
backoff so make demo survives Docker startup latency without a
hard-coded sleep.
Stable chunk IDs are deterministic UUIDv5s of
(source_file, heading_path, chunk_index). To stay in sync with edits,
ingest deletes all points for each file before re-upserting it — so
adding or removing a section never leaves orphan vectors behind.
Answer (src/rag/answer.py) builds a context block delimited by
distinctive <<<RAG_CONTEXT_BEGIN>>> / <<<RAG_CONTEXT_END>>> markers so
corpus content can't accidentally close the boundary. Each excerpt is
tagged [file > heading]. Claude receives a strict-grounding system
prompt, streams the response token-by-token, and we then print a deduped
Sources: footer.
Point the CLI at your own folder of markdown:
python rag.py ingest --dir ~/notes/k8s
python rag.py ask "what does that thing do again" --top-k 8Override defaults with env vars (see src/rag/config.py):
| Env var | Default | What it does |
|---|---|---|
ANTHROPIC_API_KEY |
(required) | Anthropic API key |
QDRANT_URL |
http://localhost:6333 |
Qdrant endpoint |
RAG_COLLECTION |
markdown-rag |
Qdrant collection name |
RAG_EMBED_MODEL |
BAAI/bge-small-en-v1.5 |
Sentence-transformers model |
RAG_ANTHROPIC_MODEL |
claude-sonnet-4-6 |
Claude model |
RAG_TOP_K |
5 |
Chunks retrieved per question |
RAG_BATCH_SIZE |
32 |
Embedding + upsert batch size |
RAG_MAX_TOKENS |
1024 |
Max tokens in Claude's reply |
RAG_CORPUS_DIR |
<repo>/corpus |
Where ingest reads markdown from |
RAG_NAMESPACE |
6d4e9a3a-3b1f-4f1b-8b9a-6c1d2c5e7f10 |
UUIDv5 namespace for chunk IDs |
pip install -r requirements.txt
pip install pytest ruff
pytest -q
ruff check . && ruff format --check .CI runs ruff + pytest on every PR (.github/workflows/ci.yml). The
smoke test mocks Qdrant and Anthropic — no network, no Docker required
in CI.
MIT
