<a href="https://colab.research.google.com/github/NBK-code/RAG_from_Scratch/blob/main/RAG_with_Vector_DB_Reranker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Retrieval-Augmented Generation

In this notebook, we will improve upon our previous RAG implementation by adding the following features:

1. Modular coding
2. Clean Pipeline
3. FAISS Vector Database
4. Reranking Module
5. Evaluation of RAG performance

##Install Necessary Software

In [None]:
!pip install PyMuPDF
!pip install spacy
!pip install -U sentence-transformers
!pip install faiss-cpu
!pip install bitsandbytes accelerate
!pip install transformers
!pip install flash-attn

##Create Project Folder and Files

In [2]:
!mkdir -p rag_project

In [3]:
files = [
    "ingestion.py",
    "chunking.py",
    "embedder.py",
    "vector_db.py",
    "rerank.py",
    "retrieval.py",
    "prompt_builder.py",
    "llm.py",
    "rag_engine.py",
    "evaluate.py",
]

for f in files:
    open(f"rag_project/{f}", "w").close()

print("Folder structure created!")

Folder structure created!


##Create All The Modules

###Data Ingestion

In [4]:
%%writefile rag_project/ingestion.py
"""
ingestion.py

Lightweight ingestion module based on the original rag_from_scratch.py.
We keep your logic, but modularize it so the rest of the pipeline can use it cleanly.
"""

import fitz

def text_formatter(text: str) -> str:
    """Formats text by removing newlines and collapsing spaces."""
    return text.replace("\n", " ").strip()

def open_and_read_pdf(pdf_path: str):
    """
    Open and read a PDF file.
    Returns a list of dictionaries, one per page.
    Keeps your original metadata structure.
    """
    pdf_document = fitz.open(pdf_path)
    pdf_pages_and_texts = []

    for page_number, page in enumerate(pdf_document):
        text = page.get_text()
        text = text_formatter(text)

        pdf_pages_and_texts.append({
            "page_number": page_number,
            "page_char_count": len(text),
            "page_word_count": len(text.split(" ")),
            "page_sentence_count_raw": len(text.split(". ")),
            "page_token_count_raw": len(text) / 4,   # approx. 4 chars = 1 token
            "text": text,
        })

    return pdf_pages_and_texts

Overwriting rag_project/ingestion.py


###Chunking

Chunks each page of the pdf document.

In [5]:
%%writefile rag_project/chunking.py
"""
chunking.py

Clean, modular version of your original chunking logic.
- Uses spaCy sentencizer
- Joins sentences
- Fixes punctuation spacing issues
- Produces chunks of N sentences
"""

import re
import spacy


def load_spacy():
    """Load spaCy sentencizer only once."""
    nlp = spacy.blank("en")
    nlp.add_pipe("sentencizer")
    return nlp


def fix_spacing(text: str) -> str:
    """
    Fix cases like:
        'How are you?I am fine' --> 'How are you? I am fine'
    Based on your original requirement.
    """
    text = re.sub(r'\.([A-Z])', r'. \1', text)
    text = re.sub(r'\?([A-Z])', r'? \1', text)
    text = re.sub(r'\!([A-Z])', r'! \1', text)
    return text


def split_into_sentences(nlp, text: str):
    """
    Split a paragraph into sentences using spaCy sentencizer.
    Returns a list of sentence strings.
    """
    text = fix_spacing(text)
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    return sentences


def chunk_sentences(sentences, chunk_size=10):
    """
    Group sentences into chunks of fixed size.
    Returns a list of strings (chunks).
    """
    chunks = []
    current_chunk = []

    for sent in sentences:
        current_chunk.append(sent)

        if len(current_chunk) >= chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = []

    # Add leftover sentences as final chunk
    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks


def chunk_page_text(nlp, page_dict, chunk_size=10):
    """
    Convert a page dict (from ingestion module) to a list of chunk dicts.
    Each chunk has:
        - text
        - metadata (page number, chunk index)
    """
    sentences = split_into_sentences(nlp, page_dict["text"])
    chunks = chunk_sentences(sentences, chunk_size)

    chunk_dicts = []
    for idx, chunk in enumerate(chunks):
        chunk_dicts.append({
            "page_number": page_dict["page_number"],
            "chunk_id": idx,
            "text": chunk,
        })

    return chunk_dicts

Overwriting rag_project/chunking.py


###Embed Text and Query

In [6]:
%%writefile rag_project/embedder.py
"""
embedder.py

Unified embedding module for:
- Text chunk embeddings (for FAISS)
- Query embeddings (for retrieval)

Uses: BAAI/bge-base-en-v1.5
"""

import numpy as np
from sentence_transformers import SentenceTransformer

class Embedder:
    def __init__(self, model_name="BAAI/bge-base-en-v1.5", device="cpu"):
        """
        Initialize the embedding model.
        device can be 'cpu' or 'cuda'
        """
        print(f"[Embedder] Loading model: {model_name} on {device}")
        self.model = SentenceTransformer(model_name, device=device)

    def embed_texts(self, texts, batch_size=16):
        """
        Embed a list of texts (chunks).
        Returns numpy array of shape (N, dim).
        """
        embeddings = self.model.encode(
            texts,
            batch_size=batch_size,
            show_progress_bar=True,
            convert_to_numpy=True,
            normalize_embeddings=True,
        )
        return embeddings.astype(np.float32)

    def embed_query(self, query):
        """
        Embed a single query string.
        Returns 1 vector (dim,).
        """
        vec = self.model.encode(
            query,
            normalize_embeddings=True,
            convert_to_numpy=True,
        )
        return vec.astype(np.float32)

Overwriting rag_project/embedder.py


###Create Vector Database

A vector database is a special kind of database that stores vectors (numerical embeddings) instead of text.

FAISS stands for Facebook AI Similarity Search.

It is not a full database server ‚Äî it is a very fast library that:

1. Stores vectors in memory
2. Performs similarity search extremely fast
3. Supports GPU acceleration
4. Widely used in RAG systems

Here we create a vector DB locally - all the vectors are stored in colab's file system.

Common FAISS Index Types

| **Index Type**      | **Exact?** | **Best Dataset Size** | **Speed**        | **Memory**     | **When to Use** |
|---------------------|-----------|------------------------|------------------|----------------|------------------|
| **Flat**            | ‚úÖ Exact  | 1k‚Äì50k                 | Slowest          | High           | Exact L2 distance; rarely needed in RAG |
| **FlatIP**          | ‚úÖ Exact  | 1k‚Äì100k                | Slow (but fast in FAISS) | High | **Best for small/medium RAG**; exact cosine similarity (normalized embeddings) |
| **HNSW32**          | ‚ö†Ô∏è Approx | 50k‚Äì10M                | Very Fast        | Medium         | **Most popular ANN**; great recall + speed balance |
| **IVF100**          | ‚ö†Ô∏è Approx | 100k‚Äì20M               | Very Fast        | Medium         | Cluster-based search; standard for large datasets |
| **IVF100,PQ16**     | ‚ö†Ô∏è Approx | 500k‚Äì100M              | Extremely Fast   | Very Low       | IVF + Product Quantization; memory-efficient for huge corpora |


In [7]:
%%writefile rag_project/vector_db.py
"""
vector_db.py

FAISS-based vector database for text RAG.
Uses IndexFlatIP (exact inner product search).
"""

import faiss
import numpy as np


class FAISSVectorDB:
    def __init__(self, dim):
        """
        Initialize a Flat (exact) inner-product FAISS index.
        Works perfectly for normalized BGE embeddings.
        """
        self.index = faiss.IndexFlatIP(dim)  # exact cosine similarity
        self.metadata_store = []
        self.dim = dim

        print(f"[FAISS] Created IndexFlatIP with dim={dim}")

    def add_embeddings(self, embeddings, metadata_list):
        embeddings = embeddings.astype(np.float32)

        if len(embeddings) != len(metadata_list):
            raise ValueError("embeddings and metadata_list must have same length")

        self.index.add(embeddings)
        self.metadata_store.extend(metadata_list)

        print(f"[FAISS] Added {len(embeddings)} vectors")

    def search(self, query_vector, top_k=5):
        query_vector = query_vector.reshape(1, -1).astype(np.float32)
        scores, indices = self.index.search(query_vector, top_k)

        results = []
        for score, idx in zip(scores[0], indices[0]):
            if idx == -1:
                continue
            results.append({
                "score": float(score),
                "metadata": self.metadata_store[idx]
            })

        return results

    def save(self, index_path, metadata_path):
        faiss.write_index(self.index, index_path)
        np.save(metadata_path, self.metadata_store, allow_pickle=True)
        print(f"[FAISS] Saved index to {index_path}")
        print(f"[FAISS] Saved metadata to {metadata_path}")

    def load(self, index_path, metadata_path):
        self.index = faiss.read_index(index_path)
        self.metadata_store = np.load(metadata_path, allow_pickle=True).tolist()
        print(f"[FAISS] Loaded index from {index_path}")

Overwriting rag_project/vector_db.py


###Reranker

Embedding search gives us only a coarse matches. For improved performance, we use a reranker.

####Cross-Encoder Reranker

- **Architecture:** A single transformer processes *query + passage together* (`[CLS] query [SEP] passage [SEP]`) with full cross-attention. Query tokens attend to passage tokens and vice-versa, enabling deep pairwise comparison. Outputs a score (relevance from 0 ‚Üí 1 or ‚àíinf ‚Üí +inf).
- **Training data:** Labeled *(query, positive passage)* and *(query, negative passage)* pairs.  
- **Training objective:** Ranking-focused losses (binary cross-entropy, contrastive, softmax cross-entropy) that push positive passages to score higher than negatives.  
- **Purpose in RAG:** FAISS retrieves broadly relevant chunks; the reranker selects the *most* relevant ones, improving precision and answer quality significantly.


In [8]:
%%writefile rag_project/rerank.py
"""
rerank.py

Cross-encoder reranking to improve retrieval quality.
Uses BAAI/bge-reranker-base.
"""

from sentence_transformers import CrossEncoder


class Reranker:
    def __init__(self, model_name="BAAI/bge-reranker-base", device="cpu"):
        """
        Initialize the cross-encoder reranker.
        """
        print(f"[Reranker] Loading model: {model_name} on {device}")
        self.model = CrossEncoder(model_name, device=device)

    def rerank(self, query, candidate_chunks, top_k=5):
        """
        Rerank candidate chunks based on relevance to the query.

        query: string
        candidate_chunks: list of dicts, each containing {metadata, score, text}
                          usually returned by FAISS

        Returns the top_k reranked chunks with 'rerank_score'.
        """

        # Prepare input pairs for cross-encoder
        pairs = []
        for item in candidate_chunks:
            text = item["metadata"]["text"]
            pairs.append([query, text])

        # Compute cross-encoder scores
        scores = self.model.predict(pairs)

        # Attach scores to chunks
        for i, item in enumerate(candidate_chunks):
            item["rerank_score"] = float(scores[i])

        # Sort by rerank score (descending)
        reranked = sorted(candidate_chunks, key=lambda x: x["rerank_score"], reverse=True)

        return reranked[:top_k]

Overwriting rag_project/rerank.py


###Retrieval

In [9]:
%%writefile rag_project/retrieval.py
"""
retrieval.py

Unified retrieval module:
- Embeds the query
- Searches FAISS vector DB
- Optionally applies Cross-Encoder reranking
"""

from embedder import Embedder
from rerank import Reranker
from vector_db import FAISSVectorDB


class Retriever:
    def __init__(self, embedder: Embedder, vector_db: FAISSVectorDB,
                 reranker: Reranker = None,
                 initial_k: int = 20,
                 final_k: int = 5):
        """
        embedder: Embedder object
        vector_db: FAISSVectorDB object
        reranker: Reranker object (optional)
        initial_k: how many candidates to pull from FAISS
        final_k: how many results to return after reranking
        """
        self.embedder = embedder
        self.vector_db = vector_db
        self.reranker = reranker
        self.initial_k = initial_k
        self.final_k = final_k

        if reranker:
            print("[Retriever] Reranking enabled.")
        else:
            print("[Retriever] Reranking disabled.")

    def retrieve(self, query):
        """
        Retrieve top chunks for the given query.
        Uses FAISS search, then optional reranking.
        """

        # 1. Embed query
        query_vec = self.embedder.embed_query(query)

        # 2. Search FAISS
        results = self.vector_db.search(query_vec, top_k=self.initial_k)

        # 3. If reranker is OFF ‚Üí return FAISS results only
        if self.reranker is None:
            return results[:self.final_k]

        # 4. If reranker is ON ‚Üí rerank the FAISS candidates
        reranked = self.reranker.rerank(query, results, top_k=self.final_k)

        return reranked

Overwriting rag_project/retrieval.py


###Augmentation - Prompt Builder

In [10]:
%%writefile rag_project/prompt_builder.py
"""
prompt_builder.py

Builds a clean prompt for LLM inference using retrieved chunks.
"""


class PromptBuilder:
    def __init__(self):
        pass

    def build_prompt(self, query, retrieved_chunks):
        """
        query: string
        retrieved_chunks: list of dicts (from retriever)
                          each dict has {score, metadata: {text, ...}}

        Returns a final LLM prompt string.
        """

        # Extract chunk texts
        context_lines = []
        for i, item in enumerate(retrieved_chunks, start=1):
            text = item["metadata"]["text"]
            context_lines.append(f"[{i}] {text}")

        context_block = "\n".join(context_lines)

        # Construct final prompt
        prompt = f"""
You are a helpful assistant. Use ONLY the information provided in the context below.
If the answer is not contained in the context, respond with "I don't know."

### Question:
{query}

### Context:
{context_block}

### Answer:""".strip()

        return prompt

Overwriting rag_project/prompt_builder.py


###LLM

In [11]:
%%writefile rag_project/llm.py
"""
llm.py

Loads Gemma 2B or Gemma 7B based on GPU memory.
Supports optional 4-bit quantization.
Provides a simple .generate(prompt) interface.
"""

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available
from transformers import BitsAndBytesConfig


class LLM:
    def __init__(self):
        print("[LLM] Detecting GPU memory...")
        gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
        gpu_memory_in_gb = round(gpu_memory_bytes / (2**30))
        print(f"[LLM] Available GPU memory: {gpu_memory_in_gb} GB")

        # Choose model
        if gpu_memory_in_gb < 5.1:
            print(f"[LLM] {gpu_memory_in_gb}GB ‚Äî Gemma 7B is too large.")
            print("[LLM] You should use Gemma 2B with 4-bit quantization.")
            self.use_quant = True
            model_id = "google/gemma-2b-it"

        elif gpu_memory_in_gb < 8.1:
            print(f"[LLM] {gpu_memory_in_gb}GB ‚Äî Recommended: Gemma 2B in 4-bit.")
            self.use_quant = True
            model_id = "google/gemma-2b-it"

        elif gpu_memory_in_gb < 19.0:
            print(f"[LLM] {gpu_memory_in_gb}GB ‚Äî Gemma 2B fp16 or Gemma 7B in 4-bit.")
            self.use_quant = False
            model_id = "google/gemma-2b-it"

        else:
            print(f"[LLM] {gpu_memory_in_gb}GB ‚Äî Gemma 7B (fp16 or 4-bit).")
            self.use_quant = False
            model_id = "google/gemma-7b-it"

        print(f"[LLM] use_quantization = {self.use_quant}")
        print(f"[LLM] Loading model: {model_id}\n")

        # Flash Attention 2 support
        if (is_flash_attn_2_available()) and (torch.cuda.get_device_capability(0)[0] >= 8):
            attn_impl = "flash_attention_2"
            print("[LLM] Using Flash Attention 2")
        else:
            attn_impl = "sdpa"
            print("[LLM] Using Scaled Dot-Product Attention")

        # Tokenizer
        print("[LLM] Loading tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

        # Optional 4-bit quantization
        quant_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16
        ) if self.use_quant else None

        # Model
        print("[LLM] Loading model weights (this may take time)...")
        self.model = AutoModelForCausalLM.from_pretrained(
            pretrained_model_name_or_path=model_id,
            torch_dtype=torch.float16,
            quantization_config=quant_config,
            low_cpu_mem_usage=False,
            attn_implementation=attn_impl
        )

        # If not quantized ‚Üí move to GPU manually
        if not self.use_quant:
            self.model.to("cuda")

        print("[LLM] Model is ready.\n")

    def generate(self, prompt, max_new_tokens=256, temperature=0.2):
        """
        Generate text from the LLM using a clean prompt.
        """
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")

        output_ids = self.model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=False,  # deterministic for RAG
        )

        output_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)

        # Remove prompt prefix if repeated
        if output_text.startswith(prompt):
            output_text = output_text[len(prompt):].strip()

        return output_text

Overwriting rag_project/llm.py


###Augmented Generation

In [12]:
%%writefile rag_project/rag_engine.py
"""
rag_engine.py

The unified RAG pipeline:
- Takes a user query
- Retrieves relevant chunks (FAISS + optional reranking)
- Builds a grounded prompt
- Generates an answer using the LLM
- Returns both answer and retrieved chunks (for debugging)
"""

from retrieval import Retriever
from prompt_builder import PromptBuilder
from llm import LLM


class RAGEngine:
    def __init__(self, retriever: Retriever, llm: LLM):
        """
        retriever: Retriever object (FAISS + optional reranker)
        llm: LLM object (Gemma 2B/7B wrapper)
        """
        self.retriever = retriever
        self.llm = llm
        self.prompt_builder = PromptBuilder()

        print("[RAGEngine] Initialized RAG engine")

    def answer(self, query: str, return_chunks: bool = False):
        """
        Runs the full RAG pipeline:

        1. Retrieve relevant chunks
        2. Build prompt
        3. Generate answer
        4. Optionally return chunks for evaluation/debugging
        """
        print(f"\n[RAGEngine] Query: {query}")

        # 1. Retrieve top chunks
        print("[RAGEngine] Retrieving chunks...")
        retrieved_chunks = self.retriever.retrieve(query)

        # 2. Build LLM prompt
        print("[RAGEngine] Building prompt...")
        prompt = self.prompt_builder.build_prompt(query, retrieved_chunks)

        # 3. Get LLM answer
        print("[RAGEngine] Generating answer...")
        answer = self.llm.generate(prompt)

        if return_chunks:
            return {
                "answer": answer,
                "chunks": retrieved_chunks,
                "prompt": prompt
            }
        else:
            return answer

Overwriting rag_project/rag_engine.py


###Orchestrate the Model

In [13]:
from huggingface_hub import login

login(token="Your HF access token here")

In [14]:
astronomy_questions = [
    "What physical processes determine how long a star remains on the main sequence?",
    "Why do massive stars evolve more quickly than low-mass stars despite having more fuel?",
    "What roles do convection and radiation play inside stars, and how do you identify which zone dominates?",
    "How do nuclear reaction rates influence the internal structure of a star?",
    "What physical conditions lead a star to end its life as a white dwarf, neutron star, or black hole?",
    "Why is the spectrum of a star not a perfect blackbody?",
    "What determines the width and shape of spectral lines in stellar spectra?",
    "In what situations does the assumption of local thermodynamic equilibrium break down in a stellar atmosphere?",
    "How does the opacity of a star's atmosphere influence the emergent spectrum?",
    "What information about a star can be inferred from its spectral classification?",
    "What evidence in spiral galaxies suggests the presence of dark matter?",
    "Why do elliptical galaxies contain little cold gas compared to spiral galaxies?",
    "How do astronomers infer the orbital motions of stars inside a galaxy?",
    "What physical processes influence the shape and structure of galaxies over time?",
    "How do galaxy interactions and mergers affect galactic evolution?",
    "What observations support the idea that the universe is expanding?",
    "Why is the cosmic microwave background considered strong evidence for the early hot universe?",
    "How do astronomers measure distances to very distant galaxies?",
    "What distinguishes dark matter from dark energy in terms of their observable effects on the universe?",
    "What role do galaxy clusters play in understanding large-scale structure?",
    "What physical conditions lead to the formation of an accretion disk around a compact object?",
    "Why are some black holes strong sources of X-rays?",
    "How do astronomers determine whether an observed compact object is likely a neutron star or a black hole?",
    "What processes can accelerate particles to relativistic speeds in astrophysical environments?",
    "How do supernova explosions influence their surrounding interstellar medium?",
    "How do transiting exoplanets produce measurable changes in starlight?",
    "What factors determine whether a planet can retain an atmosphere?",
    "How do planetary migration theories explain the presence of hot Jupiters?",
    "Why are some planetary systems so different from our solar system?",
    "What methods allow astronomers to study the atmospheres of exoplanets?",
    "What factors limit the sensitivity of a ground-based telescope?",
    "How do astronomers distinguish between signal and noise in an astronomical observation?",
    "Why are space telescopes necessary for certain wavelengths of light?",
    "What determines the resolving power of an astronomical instrument (qualitatively)?",
    "How do different types of detectors differ in how they measure incoming light?"
]

query_list = astronomy_questions


!wget -O astronomy.pdf https://www.as.utexas.edu/~elr/Astronomy-LR.pdf

--2026-02-06 14:44:56--  https://www.as.utexas.edu/~elr/Astronomy-LR.pdf
Resolving www.as.utexas.edu (www.as.utexas.edu)... 128.83.20.6
Connecting to www.as.utexas.edu (www.as.utexas.edu)|128.83.20.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41720608 (40M) [application/pdf]
Saving to: ‚Äòastronomy.pdf‚Äô


2026-02-06 14:44:57 (54.0 MB/s) - ‚Äòastronomy.pdf‚Äô saved [41720608/41720608]



In [15]:
import sys
import random
import torch

sys.path.append("/content/rag_project")

In [None]:
from ingestion import open_and_read_pdf
from chunking import load_spacy, chunk_page_text
from embedder import Embedder
from vector_db import FAISSVectorDB
import numpy as np

# 1. Load PDF
print(f"[INFO] Loading PDF...")
pages = open_and_read_pdf("astronomy.pdf")

# 2. Chunk all pages
print(f"[INFO] Chunking pages...")
nlp = load_spacy()
all_chunks = []
for p in pages:
    all_chunks.extend(chunk_page_text(nlp, p))

# 3. Embed all chunks
print(f"[INFO] Embedding chunks...")
emb = Embedder(device="cuda")
texts = [c["text"] for c in all_chunks]
embeddings = emb.embed_texts(texts, batch_size=32)

# 4. Build FAISS DB
print(f"[INFO] Building FAISS DB...")
db = FAISSVectorDB(dim=embeddings.shape[1])
db.add_embeddings(embeddings, all_chunks)

# 5. Save
print(f"[INFO] Saving FAISS DB...")
db.save("faiss_index.faiss", "metadata.npy")

In [None]:
from vector_db import FAISSVectorDB
from embedder import Embedder
from rerank import Reranker
from retrieval import Retriever
from prompt_builder import PromptBuilder
from llm import LLM
from rag_engine import RAGEngine
import numpy as np

# 1. Load vector DB
db = FAISSVectorDB(dim=768)
db.load("faiss_index.faiss", "metadata.npy")

# 2. Load embedder + reranker + llm
emb = Embedder(device="cuda")
reranker = Reranker(device="cuda")
retriever = Retriever(embedder=emb, vector_db=db, reranker=reranker)
llm = LLM()

# 3. Build RAG engine
engine = RAGEngine(retriever, llm)

In [22]:
import textwrap

while True:
    q = input("\nAsk a question (or type 'exit' to quit): ").strip()

    if q.lower() in ("exit", "quit", ""):
        print("\nExiting RAG system. Goodbye!")
        break

    result = engine.answer(q, return_chunks=True)

    answer = result["answer"]
    chunks = result["chunks"]

    print("\n" + "=" * 60)
    print("üü¶ FINAL ANSWER")
    print("=" * 60)
    print(textwrap.fill(answer, width=80))   # wrap answer text

    print("\n" + "=" * 60)
    print("üü© TOP RETRIEVED CHUNKS")
    print("=" * 60)

    for i, chunk in enumerate(chunks, start=1):

        # Pick rerank_score if available, otherwise FAISS score
        score = chunk.get("rerank_score", chunk.get("score", None))
        score_str = f"{score:.4f}" if score is not None else "N/A"

        text_preview = chunk["metadata"]["text"].strip()

        # Optionally shorten preview (keep first 600 chars)
        if len(text_preview) > 600:
            text_preview = text_preview[:600] + "..."

        print(f"\n[{i}]  Score = {score_str}")
        print("-" * 60)

        # Wrap chunk text to avoid going off-screen
        print(textwrap.fill(text_preview, width=80))

    print("\n" + "=" * 60)


Ask a question (or type 'exit' to quit): Why is Sun yellow in color?

[RAGEngine] Query: Why is Sun yellow in color?
[RAGEngine] Retrieving chunks...
[RAGEngine] Building prompt...
[RAGEngine] Generating answer...

üü¶ FINAL ANSWER
The Sun looks yellow in color as seen from Earth‚Äôs surface because the nitrogen
molecules in our planet‚Äôs atmosphere scatter some of the shorter (i.e., blue)
wavelengths of light out of the beams of sunlight that reach us, leaving more
long wavelength light behind.

üü© TOP RETRIEVED CHUNKS

[1]  Score = 0.6960
------------------------------------------------------------
Example Star Colors and Corresponding Approximate Temperatures Star Color
Approximate Temperature Example Orange 4000 K Aldebaran Red 3000 K Betelgeuse
Table 17.1 The hottest stars have temperatures of over 40,000 K, and the coolest
stars have temperatures of about 2000 K. Our Sun‚Äôs surface temperature is about
6000 K; its peak wavelength color is a slightly greenish-yellow. In spac