# Project: PDF Query System with Local Retrieval-Augmented Generation

## Overview
This project implements an **Improved Retrieval-Augmented Generation (RAG)** system that combines **information retrieval** with **language generation**.  
It allows you to ask questions and get answers grounded in your own documents using **FAISS** for similarity search and **Transformer-based** models for text generation.

The notebook is fully self-contained and modular, making it ideal for learning, experimentation, and extension into production-ready pipelines.



## 1) Requirements

Install required libraries. Run these once in the notebook environment. If you have alternative libraries already installed, adapt accordingly.

In [19]:
#!pip install -q -U sentence-transformers faiss-cpu transformers accelerate datasets tiktoken
# Optional: if using OpenAI for generation
# !pip install -q openai
#!pip install PyPDF2 


## 2) Imports and configuration

Import libraries and set configuration variables. Use local models if you prefer. Keep keys secret; use environment variables.

In [11]:
import os
from pathlib import Path
from typing import List, Tuple, Dict, Any

# Config - replace or set as environment variables
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')  # optional if you use OpenAI
DATA_DIR = Path('Data')  # place your documents here
EMBEDDING_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'  # compact, fast
GENERATION_MODEL = 'gpt2'  # placeholder. replace with a powerful local HF model or call OpenAI/GPT
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

print('Config set. Data dir:', DATA_DIR)


Config set. Data dir: Data


## 3) Load documents

Load text files, PDFs, or other sources. This example reads `.txt` files. Split into chunks for better retrieval.

In [12]:
from pathlib import Path
from typing import List, Tuple
from PyPDF2 import PdfReader

def load_pdf_files(folder: Path) -> List[Tuple[str, str]]:
    """Return list of (source_name, text) from PDFs."""
    docs = []
    if not folder.exists():
        print('Data folder not found:', folder)
        return docs
    for p in folder.glob('**/*.pdf'):
        try:
            reader = PdfReader(p)
            text = "\n".join(page.extract_text() or "" for page in reader.pages)
            docs.append((str(p), text))
        except Exception as e:
            print(f"Error reading {p}: {e}")
    return docs


docs = load_pdf_files(DATA_DIR)
print(f'Loaded {len(docs)} documents.')

Loaded 2 documents.


### 3.1) Text splitting

Chunk long documents to maintain relevance and fit embedding model limits.

In [13]:
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> List[str]:
    tokens = text.split()
    chunks = []
    start = 0
    n = len(tokens)

    while start < n:
        end = min(start + size, n)
        chunk = ' '.join(tokens[start:end])
        chunks.append(chunk)
        if end == n:
            break
        start += size - overlap  # move forward
    return chunks
# Build corpus: list of dicts with metadata
corpus = []
for src, text in docs:
    z=chunk_text(text)
    for i, chunk in enumerate(chunk_text(text)):
        corpus.append({'source': src, 'chunk_id': i, 'text': chunk})
print('Corpus chunks:', len(corpus))


Corpus chunks: 12


In [14]:
from huggingface_hub import login
login()  # then paste your new HF token

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## 4) Create embeddings and FAISS index

We create dense embeddings per chunk and add them to FAISS for nearest-neighbor retrieval.

In [15]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import pickle

# Load embedding model
embedder = SentenceTransformer(EMBEDDING_MODEL)

def embed_texts(texts: List[str]) -> np.ndarray:
    # Returns numpy array of shape (n, dim)
    embs = embedder.encode(texts, show_progress_bar=True, convert_to_numpy=True)
    return embs

# Create texts list
texts = [d['text'] for d in corpus]
if len(texts) == 0:
    print('No texts to embed. Add files to', DATA_DIR)
else:
    embeddings = embed_texts(texts)
    dim = embeddings.shape[1]
    # Build FAISS index (L2)
    index = faiss.IndexFlatL2(dim)
    index.add(embeddings)
    print('FAISS index built. vectors:', index.ntotal, 'dim:', dim)

    # Save metadata for retrieval
    meta = corpus  # list of dicts aligned with embeddings rows

    # Persist the index and meta
    faiss.write_index(index, 'rag_index.faiss')
    with open('rag_meta.pkl', 'wb') as f:
        pickle.dump(meta, f)
    print('Index and metadata saved: rag_index.faiss, rag_meta.pkl')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

FAISS index built. vectors: 12 dim: 384
Index and metadata saved: rag_index.faiss, rag_meta.pkl


## 5) Retriever function

Given a query, embed it and fetch top-k nearest chunks from FAISS. Return text and metadata.

In [16]:
def load_index_and_meta(index_path='rag_index.faiss', meta_path='rag_meta.pkl'):
    import pickle
    import faiss
    idx = faiss.read_index(index_path)
    with open(meta_path, 'rb') as f:
        meta = pickle.load(f)
    return idx, meta

index, meta = load_index_and_meta()

def retrieve(query: str, top_k: int = 4) -> List[Dict[str, Any]]:
    q_emb = embed_texts([query])
    D, I = index.search(q_emb, top_k)
    results = []
    for dist, idx in zip(D[0], I[0]):
        if idx < 0 or idx >= len(meta):
            continue
        entry = meta[idx].copy()
        entry['score'] = float(dist)
        results.append(entry)
    return results

# quick test (replace with your own query)
print('Retrieve test:', retrieve('What is the architecture of the model?', top_k=3))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Retrieve test: [{'source': 'Data\\monopoly.pdf', 'chunk_id': 4, 'text': 'HOUSES: When you own all the propert~es in a color-group you may buy houses from the Bank and erect them on those properties. If you buy one house, you may put it on any one of those properties. The next house you buy must be erected on one of the unimproved properties of this or any other complete color- group you may own. The price you must pay the Bank for each house is shown on your ltle Deed card for the property on which you erect the house. The owner still collects double rent from an opponent who lands on the unimproved properties of hisher complete color-group. Following the above rules, you may buy and erect at any time as many houses as your judgement and financial standing will allow. But you must build evenly, i.e., you cannot erect more than one house on any one property of any color-group until you have built one house on wery property of that group. You may then begin on the second row of houses, a

## 6) Generator / RAG function

Combine retrieved passages into a prompt and call a generator model. Two options shown: local HF model or OpenAI completion. Use whichever you have access to.

In [17]:
# Option A: local HuggingFace generation (no API key required)
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# NOTE: gpt2 is small and not good for long-context RAG. Replace with a larger local model if available.
tokenizer = AutoTokenizer.from_pretrained(GENERATION_MODEL)
model = AutoModelForCausalLM.from_pretrained(GENERATION_MODEL, torch_dtype='auto', device_map='auto')
generator = pipeline('text-generation', model=model, tokenizer=tokenizer, max_new_tokens=256)

def generate_with_local(prompt: str) -> str:
    out = generator(prompt, do_sample=False, num_return_sequences=1)
    return out[0]['generated_text']

# Option B: OpenAI (if you have an API key). Uncomment and set OPENAI_API_KEY above.
# import openai
# openai.api_key = OPENAI_API_KEY
# def generate_with_openai(prompt: str, model='gpt-3.5-turbo'):
#     resp = openai.ChatCompletion.create(
#         model=model,
#         messages=[{'role':'user','content':prompt}],
#         temperature=0.0,
#         max_tokens=256
#     )
#     return resp['choices'][0]['message']['content'].strip()

Device set to use cpu


### 6.1) RAG query helper

Construct a compact prompt combining the user query and top retrieved passages. Then generate the answer.

In [18]:
from textwrap import shorten
from typing import Dict, Any

PROMPT_TEMPLATE = """You are a helpful assistant that answers questions based on the given context.
Use only the information in the context to answer the question concisely. 
If the answer is not contained, say "I don't know."

Context:
{passages}

Question:
{question}

Answer:
"""

def rag_answer(query: str, top_k: int = 4, use_openai: bool = False) -> Dict[str, Any]:
    # Retrieve top-k relevant passages
    hits = retrieve(query, top_k=top_k)

    if not hits:
        return {'answer': "I don't know from the provided sources.", 'retrieved': [], 'prompt': ''}

    # Safely combine context passages
    passages_list = []
    for h in hits:
        text = h['text'].replace('\n', ' ')  # flatten newlines
        # Only shorten if really long
        if len(text) > 500:
            text = shorten(text, width=500, placeholder='...')
        passages_list.append(f"Source: {h['source']}\n{text}")

    passages = '\n---\n'.join(passages_list)

    # Build prompt
    prompt = PROMPT_TEMPLATE.format(passages=passages, question=query)

    # Generate answer
    if use_openai:
        raise NotImplementedError('OpenAI generation not enabled in this environment.')
    else:
        ans = generate_with_local(prompt)

    # Ensure the model doesn't just repeat the question
    if ans.strip().lower() == query.strip().lower():
        ans = "I don't know from the provided sources."

    return {'answer': ans, 'retrieved': hits, 'prompt': prompt}

# Example usage
res = rag_answer('Explain the principles of the game Ticket to Ride.', top_k=3)
print(res['answer'])


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


You are a helpful assistant that answers questions based on the given context.
Use only the information in the context to answer the question concisely. 
If the answer is not contained, say "I don't know."

Context:
Source: Data\ticket_to_ride.pdf
Bonus card face up next to the board π. Shuffle the Destination Ticket cards and deal 3 cards to each player ∫. Each player looks at their Destination Tickets and decides which ones they wish to keep. A playermust keep at least two, but may keep all three if he chooses. Any returned cards areplaced on the bottom of the Destination Ticket deck. This deck is then placed next to theboard ª. Players keep their Destination Tickets secret until the end of the game. You are now ready to begin....
---
Source: Data\ticket_to_ride.pdf
only one of the Double-Routes can be used. A player can claim either of the two routes between cities, but the other route is then closed to other players.∫ ∑ ∑∫ π ∏ ª To claim the route from Montréal to Toronto, a player

## Next steps checklist

- [ ] Replace GENERATION_MODEL with a strong model or enable OpenAI.
- [ ] Add PDF/Office parsing for richer ingestion.
- [ ] Add caching for embeddings.
- [ ] Add evaluation metrics and tests.

File saved as `RAG_Improved.ipynb` in the notebook root.