Skip to content

Archit-Konde/RAG

Repository files navigation

RAG Pipeline — From Scratch

Retrieval-Augmented Generation built entirely by hand. No LangChain. No LlamaIndex. Every algorithm implemented from first principles.

Python License HuggingFace Spaces Project Page


Architecture

                        ┌─────────────────────────────────────┐
                        │           INDEXING PHASE            │
                        └─────────────────────────────────────┘

  PDF / TXT
      │
      ▼
 ┌──────────┐    raw text    ┌───────────┐    chunks     ┌─────────────┐
 │ingestion │ ─────────────► │  chunker  │ ─────────────► │  embeddings │
 └──────────┘                └───────────┘                └─────────────┘
   PyPDF2                  recursive split                 all-MiniLM-L6
                           chunk_size=512                  mean pooling
                           overlap=64                      L2 normalize
                                │                               │
                                │         chunks + embeddings   │
                                ▼                               ▼
                           ┌─────────┐                  ┌────────────┐
                           │  BM25   │                  │VectorStore │
                           └─────────┘                  └────────────┘
                           TF-IDF math                  NumPy arrays
                           fit(corpus)                  cosine search


                        ┌─────────────────────────────────────┐
                        │            QUERY PHASE              │
                        └─────────────────────────────────────┘

  User Query
      │
      ├─────────────────────────────────────────────────┐
      │                                                 │
      ▼                                                 ▼
 ┌──────────┐  query vec   ┌────────────┐         ┌─────────┐
 │embeddings│ ────────────►│VectorStore │         │  BM25   │
 └──────────┘              │  .search() │         │get_top_n│
                           └────────────┘         └─────────┘
                                │                      │
                          dense results          sparse results
                                │                      │
                                └──────────┬───────────┘
                                           │
                                           ▼
                                    ┌────────────┐
                                    │  retriever │
                                    │    RRF     │
                                    └────────────┘
                                    Reciprocal Rank
                                    Fusion (k=60)
                                           │
                                    top candidates
                                           │
                                           ▼
                                    ┌────────────┐
                                    │  reranker  │
                                    │cross-encoder│
                                    └────────────┘
                                    ms-marco-MiniLM
                                    joint attention
                                           │
                                    reranked chunks
                                           │
                                           ▼
                                    ┌────────────┐
                                    │ generator  │
                                    │ raw HTTP   │
                                    └────────────┘
                                    /chat/completions
                                    source attribution
                                           │
                                           ▼
                                    Answer + Sources

Components

File Role Key Technology
src/ingestion.py Load PDF and text files PyPDF2
src/chunker.py Split text into overlapping chunks Recursive separator algorithm
src/embeddings.py Batch sentence embeddings HF Transformers + mean pooling
src/vectorstore.py Exact cosine similarity index NumPy dot product
src/bm25.py Sparse lexical retrieval Okapi BM25 from scratch
src/retriever.py Hybrid dense+sparse fusion Reciprocal Rank Fusion
src/reranker.py Cross-encoder re-ranking MS-MARCO MiniLM
src/generator.py Grounded answer generation Raw requests HTTP call
src/evaluation.py Pipeline quality metrics Precision, Recall, MRR
app.py Interactive demo UI Streamlit

Setup

git clone https://github.com/Archit-Konde/RAG.git
cd RAG

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

pip install -r requirements.txt

cp .env.example .env
# Edit .env and add your API key

streamlit run app.py

HuggingFace Spaces

  1. Fork this repo
  2. Create a new Space (SDK: Streamlit)
  3. Push or link the repo — Spaces reads the YAML frontmatter above
  4. Enter your API key in the sidebar (no .env needed on Spaces)

Running Tests

# All tests (note: embeddings/reranker tests download models on first run)
pytest tests/ -v

# Fast tests only (no model downloads)
pytest tests/ -v --ignore=tests/test_embeddings.py --ignore=tests/test_reranker.py

# With coverage report
pytest tests/ --cov=src --cov-report=term-missing

Benchmarks

Evaluated on a 25-question QA set over a 30-section HTTP/1.1 protocol corpus (~12,000 characters). Ground truth was verified by inspecting chunk boundaries before authoring test cases. Reproduce with: python scripts/run_benchmark.py

Metric Dense only Sparse only Hybrid (RRF) Hybrid + Rerank
Precision@5 0.2240 0.2160 0.2240 0.2240
Recall@5 1.0000 0.9600 1.0000 1.0000
MRR 0.9733 0.8933 0.9800 1.0000

Hybrid + Rerank achieves MRR = 1.0 — the cross-encoder placed the most relevant chunk at rank 1 for every query. Precision@5 is low by design: a 30-chunk corpus with top-5 retrieval means 25 non-relevant chunks are always returned alongside the correct one.


Key Implementation Notes

No framework abstractions — every algorithm is implemented directly:

  • chunker.py: Recursive separator-based splitting with a deque-window overlap
  • bm25.py: Okapi BM25 with Robertson-Walker IDF from the formula up
  • vectorstore.py: Cosine similarity = dot product after L2 normalization
  • retriever.py: RRF with score = Σ 1/(k + rank) across dense + sparse lists
  • embeddings.py: HF AutoModel + manual mean pooling (not sentence-transformers)
  • reranker.py: Cross-encoder raw logit scoring (not softmax — ranking only needs order)
  • generator.py: requests.post to /chat/completions — works with any OpenAI-compatible API

Learning documentdocs/LEARNING.md — full math derivations for each algorithm, suitable for a blog post.


Links


License

MIT

About

Retrieval-Augmented Generation pipeline built from scratch — no LangChain, no LlamaIndex. Every algorithm implemented from first principles.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages