# RAG Assignment — Jupyter Notebook
**Author:** Jafar Alsaleh 
**Purpose:** Build a Retrieval-Augmented Generation (RAG) pipeline that uses PDF-only knowledge sources, FAISS vector DB, sentence-transformer embeddings, and a generation model. The notebook contains: extraction, chunking, embedding, FAISS indexing, retrieval, and generation. It includes at least 3 test queries with outputs.

**Files required (place in notebook directory):**
- `alice_in_wonderland.pdf`
- `pride_and_prejudice.pdf`
- `sherlock_adventures.pdf`

Run each cell in order.


## 0) Environment / Install dependencies

This cell installs all required packages. If you are in a constrained environment (e.g., Windows where `faiss-cpu` pip wheels may not exist), follow the troubleshooting notes below the cell.


In [1]:
# Install dependencies (run once)
!pip install -q pdfplumber sentence-transformers faiss-cpu transformers torch tqdm

## 1) Imports and helper functions


In [2]:
import os
import re
from typing import List, Tuple, Dict
import pdfplumber
from tqdm import tqdm
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
import pickle
import json
import textwrap

# Generation fallback: transformers pipeline (FLAN-T5 small) if OPENAI not used.
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

# Helper: simple text cleanup
def clean_text(s: str) -> str:
    s = s.replace('\r', '\n')
    s = re.sub(r'\n\s*\n+', '\n\n', s)   # collapse multiple blank lines
    s = s.strip()
    return s

# Helper: save/load index + metadata
def save_faiss_index(index, embeddings_meta, index_path="faiss_index.bin", meta_path="index_meta.pkl"):
    faiss.write_index(index, index_path)
    with open(meta_path, "wb") as f:
        pickle.dump(embeddings_meta, f)

def load_faiss_index(index_path="faiss_index.bin", meta_path="index_meta.pkl"):
    index = faiss.read_index(index_path)
    with open(meta_path, "rb") as f:
        meta = pickle.load(f)
    return index, meta


## 2) Problem statement

**Purpose of the system:**  
Create a Retrieval-Augmented Generation system that answers natural language queries by retrieving relevant text chunks from a set of text-only PDF documents (public or self-created), using dense vector embeddings and FAISS for fast similarity search, then conditions a text generator on the retrieved passages to produce accurate, grounded answers.


## 3) Dataset / Knowledge Source

- **File type:** PDF only  
- **Content type:** Text only (no scanned images; PDFs should contain selectable text)  
- **Source:** Public domain PDFs

## 4) RAG Architecture — block diagram

Below is a textual/block diagram showing the pipeline.

PDF Collection (3 PDFs) ---> Text Extraction (pdfplumber) ---> Chunking & Index (overlapping) ---> Context Assembly (top-k chunks + prompt) ---> Generator (OpenAI or local FLAN-T5) ---> Final Answer

## 5) Text chunking strategy

We use **character-based chunking** with overlap:

- **Chunk size:** 1000 characters (~150–220 words typical)  
- **Chunk overlap:** 200 characters

**Reasoning / justification:**
- Keeps chunks large enough to retain short narrative sections or paragraphs (context) so the generator sees coherent text.
- Overlap of 200 chars reduces boundary-cut issues (important so sentences/concepts that cross chunk borders are not lost).
- Character-based chunking is robust across fonts/models where tokenizers might differ; it's implementation-simple and deterministic.
- For production, you may adapt to token-based chunking (e.g., 512–1024 tokens) if using a specific LLM tokenizer.


## 6) Embedding model selection

**Model used:** `all-MiniLM-L6-v2` from SentenceTransformers.

**Why chosen:**
- Small, fast, and accurate for semantic search over varied text.
- Produces 384-dimensional dense vectors — compact for FAISS.
- No API keys required (runs locally with CPU/GPU).
- Widely used in RAG/semantic search tutorials; good tradeoff between performance and compute needs.


## 7) Implementation: Data-loading → Chunking → Embedding → FAISS → Retrieval → Generation

Full code block-by-block. Read comments. Run sequentially.

In [3]:
# 7.1 Load PDFs and extract text
pdf_files = ["alice_in_wonderland.pdf", "pride_and_prejudice.pdf", "sherlock_adventures.pdf"]

def extract_text_from_pdf(path: str) -> str:
    text_pages = []
    with pdfplumber.open(path) as pdf:
        for p in pdf.pages:
            txt = p.extract_text()
            if txt:
                text_pages.append(txt)

    full_text = clean_text("\n\n".join(text_pages))

    # Remove Project Gutenberg boilerplate
    start_marker = "*** START OF"
    end_marker = "*** END OF"

    start_idx = full_text.find(start_marker)
    if start_idx != -1:
        full_text = full_text[start_idx + 200:]  # skip header safely

    end_idx = full_text.find(end_marker)
    if end_idx != -1:
        full_text = full_text[:end_idx]

    return full_text.strip()


corpus_texts = {}
for f in pdf_files:
    if not os.path.exists(f):
        raise FileNotFoundError(f"Required file not found: {f}. Please upload it.")
    corpus_texts[f] = extract_text_from_pdf(f)
    print(f"Loaded {f}: {len(corpus_texts[f])} characters")


Loaded alice_in_wonderland.pdf: 142706 characters
Loaded pride_and_prejudice.pdf: 505894 characters
Loaded sherlock_adventures.pdf: 587310 characters


In [4]:
# 7.2 Chunking function (character-wise)
def chunk_text(text: str, chunk_size: int = 2000, chunk_overlap: int = 300) -> List[str]:
    chunks = []
    start = 0
    L = len(text)
    while start < L:
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(clean_text(chunk))
        start = end - chunk_overlap  # overlap
        if start < 0:
            start = 0
    return [c for c in chunks if len(c) > 50]  # drop tiny chunks

# Create chunk list with metadata
all_chunks = []
for fname, txt in corpus_texts.items():
    cks = chunk_text(txt, chunk_size=1000, chunk_overlap=200)
    for i, c in enumerate(cks):
        meta = {"source": fname, "chunk_id": f"{os.path.basename(fname)}_chunk_{i}", "text_preview": c[:200]}
        all_chunks.append((c, meta))
print(f"Total chunks created: {len(all_chunks)}")


Total chunks created: 1547


In [5]:
# 7.3 Embeddings — model load and encode all chunks
embed_model_name = "all-MiniLM-L6-v2"
embed_model = SentenceTransformer(embed_model_name)

# Batch encode
texts = [c for c, _ in all_chunks]
batch_size = 64
embeddings = embed_model.encode(texts, show_progress_bar=True, batch_size=batch_size, convert_to_numpy=True)
print("Embeddings shape:", embeddings.shape)


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

Embeddings shape: (1547, 384)


In [6]:
# 7.4 Build FAISS index (inner-product / cosine similarity via normalization)
d = embeddings.shape[1]
# We'll use IndexFlatIP with normalized vectors for cosine similarity
index = faiss.IndexFlatIP(d)
# normalize embeddings for cosine similarity
def normalize(v: np.ndarray):
    norms = np.linalg.norm(v, axis=1, keepdims=True)
    norms[norms==0] = 1e-10
    return v / norms

embeddings_norm = normalize(embeddings.astype('float32'))
index.add(embeddings_norm)
print("FAISS index size (n):", index.ntotal)

# Save metadata mapping: index->meta
index_meta = [meta for _, meta in all_chunks]
# Save index and meta
save_faiss_index(index, index_meta, index_path="faiss_index.bin", meta_path="index_meta.pkl")
print("Index and metadata saved.")


FAISS index size (n): 1547
Index and metadata saved.


In [7]:
# 7.5 Retrieval function: return top_k chunks for a query
def embed_query(query: str):
    q_emb = embed_model.encode([query], convert_to_numpy=True)
    return normalize(q_emb.astype('float32'))

def retrieve(query: str, top_k: int = 5):
    qv = embed_query(query)
    D, I = index.search(qv, top_k)  # D=similarity scores, I=indices
    results = []
    for score, idx in zip(D[0], I[0]):
        meta = index_meta[idx]
        chunk_text = texts[idx]
        results.append({"score": float(score), "meta": meta, "text": chunk_text})
    return results

# Quick local test (no generation) to see retrieval
sample_q = "Who is Alice and how does she enter Wonderland?"
retrieved = retrieve(sample_q, top_k=3)
for r in retrieved:
    print("score:", r['score'], "source:", r['meta']['source'])
    print(r['text'][:400].replace("\n"," "))
    print("----")


score: 0.601995587348938 source: alice_in_wonderland.pdf
d neither of the others took the least notice of her going, though she looked back once or twice, half hoping that they would call after her: the last time she saw them, they were trying to put the Dormouse into the teapot. 'At any rate I'll never go THERE again!' said Alice as she picked her way through the wood. 'It's the stupidest tea-party I ever was at in all my life!' Just as she said this, 
----
score: 0.5658360719680786 source: alice_in_wonderland.pdf
eeble voice: 'I heard every word you fellows were saying.' 'Tell us a story!' said the March Hare. 'Yes, please do!' pleaded Alice. 'And be quick about it,' added the Hatter, 'or you'll be asleep again before it's done.' 'Once upon a time there were three little sisters,' the Dormouse began in a great hurry; 'and their names were Elsie, Lacie, and Tillie; and they lived at the bottom of a well—' '
----
score: 0.562087893486023 source: alice_in_wonderland.pdf
tale, perhaps ev

## 8) Generation layer (two options)

- **Option A (recommended if you have OpenAI API key):** Use OpenAI's completion API (gpt-3.5 / gpt-4) to condition on retrieved chunks. (Put your `OPENAI_API_KEY` in environment or use your preferred method.)
- **Option B (local, API-free):** Use HuggingFace **FLAN-T5 small** local generator (`google/flan-t5-small`) via `transformers`. This is slower but offline. We'll include both; the notebook will automatically pick OpenAI if you set `OPENAI_API_KEY`, otherwise it will use FLAN-T5 small.

**Note:** If you use Flan-T5 small, quality will be lower than a modern OpenAI LLM

In [8]:
# 8.1 Prepare a simple context-assembly & prompt template
def assemble_context(retrieved_chunks: List[dict], max_chars: int = 1200):
    ctx_parts = []
    total = 0

    for r in retrieved_chunks:
        t = r['text']

        if total + len(t) > max_chars:
            remaining = max_chars - total
            if remaining > 100:
                ctx_parts.append(t[:remaining])
            break

        ctx_parts.append(t)
        total += len(t)

    return "\n\n".join(ctx_parts)


def make_prompt(query: str, retrieved_chunks: List[dict]):
    context = assemble_context(retrieved_chunks)

    prompt = f"""
You are an academic assistant.

Using ONLY the provided context, write a short summarized answer in 3-5 sentences.
Do NOT copy large parts of the text.
Explain clearly in your own words.

Context:
{context}

Question:
{query}

Final Answer:
"""
    return prompt


In [9]:
# 8.2 Option B (local) — load FLAN-T5 pipeline (fallback)
# We will use this if OPENAI_API_KEY is not set.
try:
    generator = pipeline("text2text-generation", model="google/flan-t5-small", device=-1, max_length=384)
    print("Local generator (flan-t5-small) loaded.")
except Exception as e:
    print("Local generator failed to load. You can use OpenAI option by setting OPENAI_API_KEY. Error:", e)
    generator = None


Device set to use cpu


Local generator (flan-t5-small) loaded.


In [10]:
# 8.3 Generation function wrapper: tries OpenAI first, else uses local generator
import os

def generate_answer(prompt: str, use_openai: bool = False, openai_model: str = "gpt-3.5-turbo"):
    # If user has OPENAI_API_KEY and use_openai True, call OpenAI.
    if use_openai and os.environ.get("OPENAI_API_KEY"):
        try:
            import openai
            openai.api_key = os.environ.get("OPENAI_API_KEY")
            # Chat completion
            response = openai.ChatCompletion.create(
                model=openai_model,
                messages=[{"role":"system","content":"You are a helpful assistant."},
                          {"role":"user","content":prompt}],
                max_tokens=400,
                temperature=0.0,
            )
            return response.choices[0].message.content.strip()
        except Exception as e:
            print("OpenAI generate failed:", e)
            # fall back to local generator
    # fallback: local generator
    if generator:
        out = generator(prompt, max_length=384, do_sample=False)
        return out[0]['generated_text'].strip()
    else:
        raise RuntimeError("No generator available. Set OPENAI_API_KEY or install the local model.")


## 9) Retrieval + Generation function (end-to-end)


In [11]:
def answer_query(query: str, top_k: int = 5, use_openai_if_available: bool = False):
    retrieved = retrieve(query, top_k=top_k)
    prompt = make_prompt(query, retrieved)
    answer = generate_answer(prompt, use_openai=use_openai_if_available)
    # Provide sources seen
    sources = list({r['meta']['source'] for r in retrieved})
    return {
        "query": query,
        "answer": answer,
        "sources": sources,
        "retrieved": retrieved
    }


# 10) Minimum 3 test queries with outputs.
#### We'll show the query, execution code, and **expected** outputs (examples). 

In [12]:
# Test queries (you can change them)
test_queries = [
    # Query for Alice
    "Who is Alice and how does she enter Wonderland? Summarize briefly and reference the source.",
    # Query for Pride & Prejudice
    "What is Elizabeth Bennet's view about marriage and how does it differ from other characters?",
    # Query for Sherlock
    "In 'A Scandal in Bohemia', who is the principal character, and what is the main plot/issue?"
]

results = []
for q in test_queries:
    print("="*80)
    print("QUERY:", q)
    res = answer_query(q, top_k=2, use_openai_if_available=False)  # set True if you have OPENAI_API_KEY and want to use it
    print("\n--- GENERATED ANSWER ---\n")
    print(res['answer'])
    print("\n--- SOURCES (retrieved) ---")
    print(res['sources'])
    print("\n--- TOP retrieved chunk previews ---")
    for i, r in enumerate(res['retrieved']):
        print(f"[{i}] score={r['score']:.4f} source={r['meta']['source']} preview={r['meta']['text_preview'][:200].replace(chr(10),' ')}")
    print("\n")
    results.append(res)


QUERY: Who is Alice and how does she enter Wonderland? Summarize briefly and reference the source.

--- GENERATED ANSWER ---

The END End of Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll d neither of the others took the least notice of her going, though she looked back once or twice, half hoping that they would call after her: the last time she saw them, they were trying to put the Dormouse into the teapot. 'At any rate I'll never go THERE again!' said Alice as she picked her way through the wood. 'It's the stupidest tea-party I ever was at in all my life!'

--- SOURCES (retrieved) ---
['alice_in_wonderland.pdf']

--- TOP retrieved chunk previews ---
[0] score=0.5571 source=alice_in_wonderland.pdf preview=tale, perhaps even with the dream of Wonderland of long ago: and how she would feel with all their simple sorrows, and find a pleasure in all their simple joys, remembering her own child-life, and the
[1] score=0.5427 source=alice_in_wonderland.pdf preview=d n