Skip to content

Dip11/insightdocs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

title InsightDocs
emoji 📚
colorFrom blue
colorTo indigo
sdk streamlit
sdk_version 1.58.0
python_version 3.11
app_file app.py
pinned false
license mit

InsightDocs

A small RAG app for asking questions about Markdown documentation. Upload your docs, ask in plain English, get back answers with the source chunks the model actually used.

I built this to learn how real RAG systems are put together, end to end. Chunking, hybrid retrieval, reranking, evaluation. The plumbing that decides whether the thing actually works, not just the embed-and-retrieve toy version.

Live demo: [your HF Spaces URL here]

What it does

  1. Loads your Markdown files
  2. Splits them along header boundaries, keeping section hierarchy as metadata
  3. Indexes the chunks with both a vector store (Chroma) and BM25
  4. For each question, runs hybrid retrieval, fuses the results, reranks with a cross-encoder
  5. Sends the top chunks to an LLM with instructions to cite every claim

Every answer includes inline [1], [2] citations and an expandable Sources panel showing the exact text used. If the retrieved context doesn't contain the answer, the model is told to refuse rather than guess.

Tech choices and why

Embeddings through OpenRouter. I call nvidia/llama-nemotron-embed-vl-1b-v2:free for the vectors (2048 dimensions). It's on the free tier, so there's no bill, but it does mean both indexing and every query hit the network. The only model that runs locally is the reranker. I went this way to keep the install lighter; the cost is that you need a connection and the free embedding endpoint has its own rate limits.

OpenRouter for the chat model too. Currently openrouter/free, which routes to whatever free chat model is up. I use the openrouter Python client, so changing models is a one-line edit in config.py. The free tier sometimes hands back an empty completion when it's busy; the app catches that and shows a refusal instead of crashing.

Two-pass chunking. First pass splits on Markdown headers and tracks the hierarchy (Refunds > Refund window) as metadata. Second pass falls back to a recursive character splitter when a section is too big. Tokens are counted with tiktoken, so chunk sizes are real rather than guessed from word counts.

Hybrid retrieval with RRF. Dense retrieval handles semantic matches. BM25 catches the exact terms an embedding model tends to smooth over, like function names and error codes. Reciprocal Rank Fusion merges the two ranked lists without having to normalize their score scales.

Cross-encoder reranker. BAAI/bge-reranker-base rescores the top 10 to 20 candidates from hybrid retrieval. It's slower than the embedding-only path, but it reads each query-chunk pair together, which catches relevance problems a bi-encoder can't see. You can turn it off per query in the sidebar.

Running locally

git clone <this repo>
cd RagPipeline
python3 -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Create a .env with your OpenRouter key:

OPENROUTER_API_KEY=sk-or-v1-...

Then run the web UI:

streamlit run app.py

If streamlit resolves to a different Python (Anaconda, for example), run it through the venv directly:

.venv/bin/python -m streamlit run app.py

There's also a terminal mode for poking at retrieval without the UI:

python -m rag_pipeline.main_hybrid          # full hybrid + rerank
python -m rag_pipeline.main_hybrid eval     # four-config ablation
python -m rag_pipeline.main_basic           # dense only

The first reranked query downloads the reranker model (about 280 MB) into a local cache. After that it's fast.

Evaluation

The eval set is 15 questions over the bundled Stripe docs: 12 in-scope, 3 out-of-scope. It runs the same questions under four retrieval setups (dense only, sparse only, hybrid, hybrid + reranker) so you can see what each part contributes.

On this corpus the full pipeline passes all 15. That number is more flattering than useful, because the corpus is tiny and the three documents barely share vocabulary. The eval set is really a regression check while iterating, not a quality claim at scale.

The per-question diagnostics were the interesting part:

The hardest case was an out-of-scope question about cryptocurrency fees. Dense retrieval confidently returned chunks about US card fees: right topic, wrong content. The refusal worked because of the prompt, not the retrieval. Hybrid + reranker held up better here because BM25 returned nothing (no crypto terms anywhere in the corpus), which thinned out the context and nudged the model toward refusing.

A vocabulary-mismatch question ("how do I stop my server from double-processing the same event?") found the right chunk about idempotency with no word overlap at all, but the dense distance was much weaker than for direct-vocab questions. At larger scale, that's the first kind of question that starts to fail.

Limitations

  • The bundled corpus is three Markdown files. I haven't tested it on real document volumes.
  • The cross-encoder runs on CPU, so full-pipeline queries take 5 to 10 seconds.
  • The Hugging Face free Space sleeps after inactivity; the first request after a sleep takes about 30 seconds to warm up.
  • Free-tier OpenRouter rate limits apply to both embeddings and chat.

Project structure

.
├── app.py                   # Streamlit UI
├── rag_pipeline/            # the RAG engine, as a package
│   ├── loader.py            # read Markdown from a folder or uploads
│   ├── chunking.py          # header-aware, token-bounded splitting
│   ├── embeddings.py        # OpenRouter embedding calls
│   ├── indexer.py           # build the Chroma + BM25 indices
│   ├── retrieval.py         # dense, sparse, RRF, rerank
│   ├── generation.py        # prompt the LLM, format the source list
│   ├── evaluation.py        # eval + ablation harness
│   ├── config.py            # client, model names, paths
│   ├── main_basic.py        # terminal mode, dense only
│   └── main_hybrid.py       # terminal mode, full hybrid
├── documents/               # sample Stripe docs (payments, refunds, webhooks)
├── .streamlit/config.toml   # turns off the file watcher (avoids a torchvision import clash)
├── requirements.txt
└── README.md

Deploying to Hugging Face Spaces

  1. Create a Streamlit Space and push this repo to it.
  2. Add OPENROUTER_API_KEY under Settings, Secrets (the .env file is gitignored, so it won't be uploaded).
  3. The frontmatter at the top of this file configures the Space (SDK, versions, entry file).

Releases

No releases published

Packages

 
 
 

Contributors