InsightDocs

title	InsightDocs
emoji	📚
colorFrom	blue
colorTo	indigo
sdk	streamlit
sdk_version	1.58.0
python_version	3.11
app_file	app.py
pinned	false
license	mit

InsightDocs

A small RAG app for asking questions about Markdown documentation. Upload your docs, ask in plain English, get back answers with the source chunks the model actually used.

I built this to learn how real RAG systems are put together, end to end. Chunking, hybrid retrieval, reranking, evaluation. The plumbing that decides whether the thing actually works, not just the embed-and-retrieve toy version.

Live demo: [your HF Spaces URL here]

What it does

Loads your Markdown files
Splits them along header boundaries, keeping section hierarchy as metadata
Indexes the chunks with both a vector store (Chroma) and BM25
For each question, runs hybrid retrieval, fuses the results, reranks with a cross-encoder
Sends the top chunks to an LLM with instructions to cite every claim

Every answer includes inline [1], [2] citations and an expandable Sources panel showing the exact text used. If the retrieved context doesn't contain the answer, the model is told to refuse rather than guess.

Tech choices and why

Embeddings through OpenRouter. I call nvidia/llama-nemotron-embed-vl-1b-v2:free for the vectors (2048 dimensions). It's on the free tier, so there's no bill, but it does mean both indexing and every query hit the network. The only model that runs locally is the reranker. I went this way to keep the install lighter; the cost is that you need a connection and the free embedding endpoint has its own rate limits.

OpenRouter for the chat model too. Currently openrouter/free, which routes to whatever free chat model is up. I use the openrouter Python client, so changing models is a one-line edit in config.py. The free tier sometimes hands back an empty completion when it's busy; the app catches that and shows a refusal instead of crashing.

Two-pass chunking. First pass splits on Markdown headers and tracks the hierarchy (Refunds > Refund window) as metadata. Second pass falls back to a recursive character splitter when a section is too big. Tokens are counted with tiktoken, so chunk sizes are real rather than guessed from word counts.

Hybrid retrieval with RRF. Dense retrieval handles semantic matches. BM25 catches the exact terms an embedding model tends to smooth over, like function names and error codes. Reciprocal Rank Fusion merges the two ranked lists without having to normalize their score scales.

Cross-encoder reranker. BAAI/bge-reranker-base rescores the top 10 to 20 candidates from hybrid retrieval. It's slower than the embedding-only path, but it reads each query-chunk pair together, which catches relevance problems a bi-encoder can't see. You can turn it off per query in the sidebar.

Running locally

git clone <this repo>
cd RagPipeline
python3 -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Create a .env with your OpenRouter key:

OPENROUTER_API_KEY=sk-or-v1-...

Then run the web UI:

streamlit run app.py

If streamlit resolves to a different Python (Anaconda, for example), run it through the venv directly:

.venv/bin/python -m streamlit run app.py

There's also a terminal mode for poking at retrieval without the UI:

python -m rag_pipeline.main_hybrid          # full hybrid + rerank
python -m rag_pipeline.main_hybrid eval     # four-config ablation
python -m rag_pipeline.main_basic           # dense only

The first reranked query downloads the reranker model (about 280 MB) into a local cache. After that it's fast.

Evaluation

The eval set is 15 questions over the bundled Stripe docs: 12 in-scope, 3 out-of-scope. It runs the same questions under four retrieval setups (dense only, sparse only, hybrid, hybrid + reranker) so you can see what each part contributes.

On this corpus the full pipeline passes all 15. That number is more flattering than useful, because the corpus is tiny and the three documents barely share vocabulary. The eval set is really a regression check while iterating, not a quality claim at scale.

The per-question diagnostics were the interesting part:

The hardest case was an out-of-scope question about cryptocurrency fees. Dense retrieval confidently returned chunks about US card fees: right topic, wrong content. The refusal worked because of the prompt, not the retrieval. Hybrid + reranker held up better here because BM25 returned nothing (no crypto terms anywhere in the corpus), which thinned out the context and nudged the model toward refusing.

A vocabulary-mismatch question ("how do I stop my server from double-processing the same event?") found the right chunk about idempotency with no word overlap at all, but the dense distance was much weaker than for direct-vocab questions. At larger scale, that's the first kind of question that starts to fail.

Limitations

The bundled corpus is three Markdown files. I haven't tested it on real document volumes.
The cross-encoder runs on CPU, so full-pipeline queries take 5 to 10 seconds.
The Hugging Face free Space sleeps after inactivity; the first request after a sleep takes about 30 seconds to warm up.
Free-tier OpenRouter rate limits apply to both embeddings and chat.

Project structure

.
├── app.py                   # Streamlit UI
├── rag_pipeline/            # the RAG engine, as a package
│   ├── loader.py            # read Markdown from a folder or uploads
│   ├── chunking.py          # header-aware, token-bounded splitting
│   ├── embeddings.py        # OpenRouter embedding calls
│   ├── indexer.py           # build the Chroma + BM25 indices
│   ├── retrieval.py         # dense, sparse, RRF, rerank
│   ├── generation.py        # prompt the LLM, format the source list
│   ├── evaluation.py        # eval + ablation harness
│   ├── config.py            # client, model names, paths
│   ├── main_basic.py        # terminal mode, dense only
│   └── main_hybrid.py       # terminal mode, full hybrid
├── documents/               # sample Stripe docs (payments, refunds, webhooks)
├── .streamlit/config.toml   # turns off the file watcher (avoids a torchvision import clash)
├── requirements.txt
└── README.md

Deploying to Hugging Face Spaces

Create a Streamlit Space and push this repo to it.
Add OPENROUTER_API_KEY under Settings, Secrets (the .env file is gitignored, so it won't be uploaded).
The frontmatter at the top of this file configures the Space (SDK, versions, entry file).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.streamlit		.streamlit
documents		documents
rag_pipeline		rag_pipeline
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InsightDocs

What it does

Tech choices and why

Running locally

Evaluation

Limitations

Project structure

Deploying to Hugging Face Spaces

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InsightDocs

What it does

Tech choices and why

Running locally

Evaluation

Limitations

Project structure

Deploying to Hugging Face Spaces

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages