A minimal, end‑to‑end Retrieval‑Augmented Generation (RAG) pipeline for PDFs. It extracts and cleans text from PDFs, chunks it with token‑aware packing and overlap, embeds chunks with either a local SentenceTransformer or OpenAI embeddings, indexes them in FAISS, and answers questions by retrieving top‑k chunks and generating a cited answer with OpenAI Chat Completions.
- PDF extraction: Uses PyMuPDF to extract page text with optional header/footer cleaning.
- Chunking: Token‑aware packing via
tiktokenwith configurable target size and overlap. - Embeddings: Local
all‑MiniLM‑L6‑v2by default, or OpenAItext-embedding-3-smallwhen enabled. - Vector index: FAISS L2 index saved to disk with aligned
embeddings.npyandids.npy. - Retrieval: Top‑k nearest neighbor search for a query.
- Generation: OpenAI Chat Completion with strict “use provided context” system message and bracketed citations.
- Simple pipelines:
pipelines/ingest.pyandpipelines/query.pyrunnable as scripts.
/miniRAG
data/
raw/ # place your PDFs here
processed/ # chunks.jsonl + meta.json
index/ # faiss.index + embeddings.npy + ids.npy
pipelines/
ingest.py # build chunks + embeddings + FAISS index
query.py # retrieve and generate an answer
src/
config.py # configuration, paths, model choices
preprocess/
extract_pdf.py # PDF -> pages
cleaning.py # page cleaning helpers
chunking.py # sentence split + token‑aware packing
embed.py # local or OpenAI embeddings
index_store.py # save/load FAISS + npy sidecars
rag/
retrieval.py # top‑k search
generator.py # LLM answer with citations
models/
chunk.py # Chunk dataclass
rag_app.py # small in‑memory demo (toy example)
requirements.txt
ReadME.md
- Python 3.10+
- macOS/Linux/Windows
Install dependencies:
python -m venv .venv && source .venv/bin/activate # on macOS/Linux
# or: .venv\Scripts\activate on Windows
pip install -r requirements.txtSet environment variables in a .env file at the project root:
OPENAI_API_KEY=sk-... # required for generation; also for OpenAI embeddings if enabled
USE_OPENAI_EMBEDDINGS=false # set to true to use OpenAI embeddings; default uses SentenceTransformers
Models and defaults are defined in src/config.py:
MODEL_CHAT:gpt-4o-miniMODEL_EMBEDDING:text-embedding-3-small(used only ifUSE_OPENAI_EMBEDDINGS=true)TARGET_TOKENS: 500OVERLAP_TOKENS: 64- Paths for
data/processedanddata/indexoutputs
pipelines/ingest.py will:
- Extract and clean pages from the PDF
- Chunk text into token‑bounded chunks with overlap
- Embed chunks (local or OpenAI)
- Build a FAISS index and save artifacts
Run (example uses the included sample data/raw/oldmansea.pdf):
python pipelines/ingest.pyBy default, the script’s __main__ runs with:
pdf_path="data/raw/oldmansea.pdf"doc_id="oldmansea"
To call programmatically:
from pipelines.ingest import main
main(pdf_path="data/raw/oldmansea.pdf", doc_id="oldmansea")Artifacts produced:
data/processed/chunks.jsonl— one JSON per chunk with ids and textdata/processed/meta.json— basic dataset metadatadata/index/faiss.index— FAISS L2 indexdata/index/embeddings.npy— embedding vectorsdata/index/ids.npy— parallel array of chunk_ids
pipelines/query.py loads chunks and the FAISS index, retrieves top‑k matches, and asks the chat model to answer strictly from the context with citations.
Run an example question:
python pipelines/query.py__main__ contains a few example questions; edit them or call programmatically:
from pipelines.query import main
main("What kind of fish does Santiago catch?", k=4)Output is a single answer string that includes citations like [doc_id:chunk_number].
rag_app.py shows a minimal, self‑contained RAG loop over an in‑memory list of strings using SentenceTransformers and OpenAI chat. It’s independent of the disk‑backed pipelines and useful for quick sanity checks.
Run:
python rag_app.py- PDF → pages:
extract_pdf.extract_pagesreads pages and appliescleaning.clean_page. - Pages → chunks:
chunking.pack_chunkssentence‑splits and packs to ~TARGET_TOKENSwithOVERLAP_TOKENS. - Chunks → embeddings:
embed.embed_textsuses either local SentenceTransformers or OpenAI embeddings. - Embeddings → FAISS:
index_store.build_and_save_indexwritesfaiss.index,embeddings.npy, andids.npy. - Query → top‑k:
rag.retrieval.top_kembeds the query and searches FAISS. - Context → answer:
rag.generator.answercalls OpenAI Chat with a system instruction to only use provided context and to cite chunks.
- Ensure
.envhas a validOPENAI_API_KEYbefore runningquery.pyorrag_app.py. - CPU‑only FAISS is used (
faiss-cpu). If import errors occur, reinstall with the pinned version fromrequirements.txt. - If you enable OpenAI embeddings, usage costs will apply; otherwise local embeddings are free but require downloading a small SentenceTransformers model on first run.
- Chunk sizes are approximate (tokenized by
tiktoken). AdjustTARGET_TOKENS/OVERLAP_TOKENSinsrc/config.pyto tune recall/precision.
MIT (add your actual license if different).