Perfect 👌 now that you have a working **LLM module + prompting techniques**, the next big step is **RAG (Retrieval-Augmented Generation)**.

Let’s structure this step by step:

---

## 🔹 What is RAG?

**Retrieval-Augmented Generation (RAG)** is a pipeline where:

1. You **store documents** in a vector database (or embeddings store).
2. At query time, you **retrieve relevant chunks** of text based on semantic similarity.
3. You **feed those retrieved chunks + user query** into your LLM to get an enriched answer.

This avoids hallucinations and lets your model answer **domain-specific questions**.

---

## 🔹 Minimal RAG Pipeline Components

1. **Document Loader** → Load your text, PDFs, or dataset.
2. **Text Splitter** → Break text into chunks (so retrieval works better).
3. **Embedding Model** → Convert text chunks into dense vectors.
4. **Vector Store** → Save and query embeddings (e.g., FAISS, Chroma).
5. **Retriever** → Find top-k similar chunks for a query.
6. **LLM Integration** → Combine retrieved chunks with user’s question, then send to your `LLMClient`.

---


## 🔹 1. **Fixed-Size Chunking**

* **Description**: Split text into chunks of fixed size (e.g., 500 tokens or 1000 characters).
* **Pros**: Simple, fast, works for most use cases.
* **Cons**: Can cut sentences in half, losing semantic meaning.
* **Use case**: When text is uniform and doesn’t need sentence alignment (e.g., Wikipedia, transcripts).

---

## 🔹 2. **Sliding Window (Overlap) Chunking**

* **Description**: Like fixed-size, but with overlap (e.g., 500 tokens with 100-token overlap).
* **Pros**: Prevents loss of context at chunk boundaries.
* **Cons**: More storage, redundant embeddings.
* **Use case**: When answers may span across chunk boundaries (legal documents, long narratives).

---

## 🔹 3. **Sentence-Based Chunking**

* **Description**: Split by sentences using NLP tools (spaCy, NLTK, etc.).
* **Pros**: Keeps meaning intact, avoids cutting in the middle.
* **Cons**: Some sentences can be too short/too long.
* **Use case**: QA systems, chatbots where sentence coherence is key.

---

## 🔹 4. **Semantic Chunking**

* **Description**: Use embeddings to split text where topic/semantic shifts occur.
* **Pros**: Preserves logical flow, more meaningful retrieval.
* **Cons**: Expensive (needs embedding during chunking).
* **Use case**: Research papers, mixed-topic articles.

---

## 🔹 5. **Recursive Text Splitting (LangChain style)**

* **Description**: Try to split by paragraph → if too long, split by sentence → if still too long, split by fixed-size.
* **Pros**: Adaptive, balances context and size.
* **Cons**: More complex, requires custom logic.
* **Use case**: Generic, production-ready RAG pipelines.

---

## 🔹 6. **Section / Structural Chunking**

* **Description**: Split based on document structure (headings, sections, chapters).
* **Pros**: Maintains context within logical units.
* **Cons**: Requires structured docs (Markdown, PDF with headings).
* **Use case**: Knowledge bases, manuals, legal docs.

---

## 🔹 7. **Dialogue / Turn-Based Chunking**

* **Description**: For chat logs, split by speaker turns.
* **Pros**: Preserves conversational context.
* **Cons**: Some turns may be too short.
* **Use case**: Customer support RAG, call center analysis.

---


# 🔹 What are Embeddings?

Embeddings are **numerical vector representations of text (or other data like images, audio, code, etc.)** in a high-dimensional space.

* Words, sentences, or documents are mapped to vectors.
* Similar meaning → vectors close together.
* Different meaning → vectors far apart.

👉 This is what makes **semantic search** in RAG possible. Instead of keyword search, embeddings allow "meaning-based" search.

---

# 🔹 Types of Embeddings (as per RAG)

In RAG pipelines, you’ll generally encounter these embedding approaches:

### 1. **Word Embeddings**

* Classic ones like **Word2Vec, GloVe, FastText**.
* Represent words only (no sentence-level context).
* **Limitations**: Same vector for "bank" (river bank vs money bank).
* ✅ Good for small/simple RAG but outdated.

---

### 2. **Sentence / Document Embeddings**

* Models: **Sentence-BERT (SBERT), Universal Sentence Encoder**.
* Encode **whole sentences or paragraphs** into a single vector.
* ✅ Most common in RAG because queries & chunks align well.

---

### 3. **Contextual Embeddings (Transformer-based)**

* From **BERT, RoBERTa, DistilBERT, DeBERTa**.
* Tokens depend on surrounding context.
* You can pool them (mean/max pooling) → single vector for sentence.
* ✅ Good accuracy, moderate size.

---

### 4. **Instruction-Tuned Embeddings**

* Models: **OpenAI `text-embedding-ada-002`, `text-embedding-3-large/small`**, **E5, Instructor-XL**.
* Specially trained for **retrieval + RAG tasks** (understand queries vs docs).
* ✅ Best choice for production-grade RAG.

---

### 5. **Multimodal Embeddings**

* For **text + image + video**.
* Models: **CLIP, BLIP, ALIGN**.
* Example: Search “a dog with sunglasses” and retrieve the correct image.
* ✅ Useful if your RAG involves **image search or video docs**.

---

# 🔹 Embedding Models (popular ones for RAG)

| Model                      | Provider | Dim        | Speed  | Best Use                    |
| -------------------------- | -------- | ---------- | ------ | --------------------------- |
| **text-embedding-ada-002** | OpenAI   | 1536       | Fast   | General RAG                 |
| **text-embedding-3-large** | OpenAI   | 3072       | Slower | High-accuracy RAG           |
| **E5-base / E5-large**     | HF       | 768 / 1024 | Good   | Open-source RAG             |
| **Instructor-XL**          | HF       | 768        | Good   | Query-aware RAG             |
| **Sentence-BERT**          | HF       | 768        | Fast   | Lightweight semantic search |
| **CLIP**                   | OpenAI   | 512–1024   | Good   | Image+text retrieval        |

---

# 🔹 Parameters to Know in Embedding Models

1. **Dimension size** → higher = more expressive, but storage ↑

   * e.g., SBERT = 768-d, OpenAI ada = 1536-d.
2. **Cosine similarity / Dot product** → used for comparing embeddings.
3. **Speed vs Accuracy tradeoff** → larger models (Instructor, E5-large) are slower but more accurate.
4. **Specialization** → general (OpenAI) vs instruction-tuned (Instructor).

---

# 🔹 Embeddings in RAG Flow

1. Chunk text (Fixed / Recursive).
2. Encode each chunk → embedding.
3. Store in **vector DB** (FAISS, Qdrant, Pinecone, Weaviate).
4. Encode query → embedding.
5. Find **nearest vectors** (semantic match).
6. Pass retrieved docs → LLM for answer generation.


# 🔹 What is a Vector Database?

A **vector database** stores high-dimensional vectors (embeddings) and provides **fast similarity search** (e.g., “which vectors are closest to my query?”).

* Instead of SQL queries → we use **nearest neighbor search** (cosine similarity, dot product, L2 distance).
* Example: You search “What is quantum computing?” → query gets embedded → vector DB finds most similar doc embeddings → send them to LLM for answering.

---

# 🔹 Core Features of Vector DBs

1. **Vector storage** → Store embeddings from documents, images, audio, etc.
2. **Indexing** → Organize vectors for **fast search** (e.g., HNSW, IVF, PQ).
3. **Similarity search** → Cosine, dot product, Euclidean distance.
4. **Metadata filtering** → e.g., only docs from `2023` or `category=finance`.
5. **Scalability** → Handle millions/billions of vectors.
6. **Integrations** → APIs for Python, REST, gRPC, LangChain, LlamaIndex.

---

# 🔹 Types of Vector Databases

### 1. **Lightweight Local Indexes (Library-based)**

* Examples: **FAISS (Meta), Annoy (Spotify), HNSWlib**.
* Run locally (in-memory or disk).
* Very fast for **small to medium datasets**.
* ✅ Best for prototyping, small RAG apps.
* ❌ No persistence or distributed scaling (unless wrapped).

---

### 2. **Cloud-Native Managed Vector DBs**

* Examples: **Pinecone, Weaviate Cloud, Qdrant Cloud, Milvus Cloud**.
* Fully managed: auto-scaling, persistence, backups.
* Built for **production RAG apps**.
* ✅ Easy to integrate, powerful APIs.
* ❌ Costly for large-scale usage.

---

### 3. **Self-Hosted Vector DBs**

* Examples: **Weaviate, Milvus, Qdrant (open source)**.
* Run on your own server/cloud.
* ✅ Scalable & production-ready with control.
* ❌ Setup/maintenance required.

---

### 4. **Hybrid Search Engines (Text + Vector)**

* Examples: **Elasticsearch + vector plugin, Vespa, Redis Vector Search, PostgreSQL pgvector**.
* Can handle **keyword + semantic search together**.
* ✅ Useful when you need **hybrid RAG** (combine BM25 + embeddings).
* ❌ Usually slower than purpose-built vector DBs.

---

# 🔹 Popular Vector DBs and Use Cases

| Vector DB                              | Type                | Best Use Case                                |
| -------------------------------------- | ------------------- | -------------------------------------------- |
| **FAISS** (Meta)                       | Local library       | Prototyping, personal projects, academic RAG |
| **Annoy** (Spotify)                    | Local library       | Music similarity, recommendation engines     |
| **HNSWlib**                            | Local library       | Fast nearest neighbor, lightweight apps      |
| **Pinecone**                           | Managed cloud       | Large-scale enterprise RAG                   |
| **Qdrant**                             | Self-hosted / Cloud | Open-source, GPU-accelerated, scalable       |
| **Weaviate**                           | Self-hosted / Cloud | Hybrid search, multimodal (text+image)       |
| **Milvus**                             | Self-hosted / Cloud | Industry-scale big vector storage            |
| **Elasticsearch (with vector search)** | Hybrid search       | Enterprise apps mixing keyword + semantic    |
| **Redis Vector Search**                | In-memory + hybrid  | Real-time recommendations                    |
| **Postgres pgvector**                  | SQL + Vector        | When you want SQL + embeddings in one DB     |

---

# 🔹 Use Cases of Vector DBs

### 📚 Knowledge Retrieval (RAG)

* Store embeddings of documents (PDFs, websites, manuals).
* Query embeddings → retrieve relevant docs → answer with LLM.

### 🎵 Recommendations

* Store embeddings of songs → recommend similar tracks (Spotify style).

### 🛍️ E-commerce Search

* Store product embeddings → semantic search for “comfortable red running shoes”.

### 🎥 Multimedia Search

* Image/video embeddings (CLIP) → “find all images with a dog and a ball”.

### 👨‍💻 Personalized Assistants

* Store conversation history embeddings → retrieve context for chatbots.

### 🧠 Anomaly Detection

* Store user behavior embeddings → flag abnormal activity.

---

# 🔹 Key Differences in DB Choice

* Small project → **FAISS** (fast, local).
* Production with scaling → **Pinecone / Qdrant / Weaviate**.
* SQL + vector together → **Postgres pgvector**.
* Real-time apps → **Redis Vector Search**.

In [25]:
TOP_K =3
PDF_PATH = r"C:\Users\SurajPatra\Desktop\FASTAPI\GENAI_LIB\rag_tools\paracetamol.pdf"
EMBED_MODEL_NAME = "all-MiniLM-L6-v2"
FAISS_INDEX_PATH = "data/paracetamol_faiss.index"
CHUNKS_JSON = "data/paracetamol_chunks.json"


In [1]:
import os
import sys
import json
import uuid
import math
import faiss
import time
import logging
from typing import List, Tuple

In [2]:
import PyPDF2

# Embeddings
from sentence_transformers import SentenceTransformer
import numpy as np

# HTTP requests (for direct Ollama fallback)
import requests

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Optionally import your LLMClient if packaged
try:
    from llm_tools.llm_tools import LLMConfig, LLMClient, Provider
    HAS_LLM_CLIENT = True
except Exception:
    HAS_LLM_CLIENT = False

In [4]:
# Ollama config (fallback direct HTTP)
OLLAMA_BASE = os.environ.get("OLLAMA_BASE", "http://localhost:11434")
OLLAMA_MODEL = "mistral:latest"

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("rag-demo")

In [5]:
def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from all pages of a PDF using PyPDF2."""
    if not os.path.isfile(pdf_path):
        raise FileNotFoundError(f"PDF not found: {pdf_path}")
    text_parts = []
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        for i, page in enumerate(reader.pages):
            try:
                txt = page.extract_text() or ""
            except Exception as e:
                logger.warning("Failed to extract page %s: %s", i, e)
                txt = ""
            text_parts.append(txt)
    full_text = "\n".join(text_parts)
    return full_text

In [6]:
def sentence_tokenize(text: str) -> List[str]:
    """Very simple sentence splitter using punctuation. Good enough for demo."""
    import re
    # split on ., ?, ! followed by space and uppercase (rough)
    sentences = re.split(r'(?<=[\.\?\!])\s+', text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]
    return sentences

In [7]:
def chunk_sentences(sentences: List[str], max_tokens: int = 120, overlap: int = 20) -> List[str]:
    """
    Build chunks by concatenating sentences until we reach approx max_tokens (estimated by words),
    and include overlap sentences between chunks.
    This is a simple heuristic chunker appropriate for short docs.
    """
    chunks = []
    cur = []
    cur_words = 0
    i = 0
    while i < len(sentences):
        s = sentences[i]
        s_words = len(s.split())
        # if adding this sentence stays under limit, add
        if cur_words + s_words <= max_tokens or not cur:
            cur.append(s)
            cur_words += s_words
            i += 1
        else:
            chunk_text = " ".join(cur)
            chunks.append(chunk_text)
            # overlap: roll back `overlap` words by keeping last few sentences
            if overlap > 0:
                # keep last n sentences (approx based on words)
                keep = []
                keep_words = 0
                while cur and keep_words < overlap:
                    sent = cur.pop()  # remove last sentence
                    keep.insert(0, sent)
                    keep_words += len(sent.split())
                cur = keep
                cur_words = sum(len(s.split()) for s in cur)
            else:
                cur = []
                cur_words = 0
    # flush
    if cur:
        chunks.append(" ".join(cur))
    return chunks

In [11]:
class SimpleFaissStore:
    def __init__(self, dim: int):
        self.dim = dim
        self.index = faiss.IndexFlatIP(dim)  # use inner product; we'll normalize embeddings for cosine
        self.id_to_meta = {}  # map index position -> metadata
        self.ntotal = 0

    def add(self, embeddings: np.ndarray, metas: List[dict]):
        """
        embeddings: numpy array (N, dim)
        metas: list of metadata dicts length N
        """
        assert embeddings.shape[1] == self.dim
        # normalize for cosine similarity with inner product
        faiss.normalize_L2(embeddings)
        self.index.add(embeddings.astype("float32"))
        base = self.ntotal
        for i, m in enumerate(metas):
            self.id_to_meta[base + i] = m
        self.ntotal += embeddings.shape[0]
        logger.info("Added %d vectors, total now %d", embeddings.shape[0], self.ntotal)

    def search(self, query_vec: np.ndarray, k: int = 5) -> List[Tuple[float, dict]]:
        # query_vec: (dim,) or (1,dim)
        v = query_vec.reshape(1, -1).astype("float32")
        faiss.normalize_L2(v)
        D, I = self.index.search(v, k)
        results = []
        for score, idx in zip(D[0], I[0]):
            if idx < 0:
                continue
            meta = self.id_to_meta.get(int(idx), {})
            results.append((float(score), meta))
        return results
    
    def save(self, index_path: str, meta_path: str):
        faiss.write_index(self.index, index_path)
        with open(meta_path, "w", encoding="utf8") as f:
            json.dump(self.id_to_meta, f, ensure_ascii=False, indent=2)
        logger.info("Saved index to %s and metadata to %s", index_path, meta_path)

    @classmethod
    def load(cls, index_path: str, meta_path: str):
        if not os.path.isfile(index_path) or not os.path.isfile(meta_path):
            raise FileNotFoundError("Index or metadata not found.")
        index = faiss.read_index(index_path)
        with open(meta_path, "r", encoding="utf8") as f:
            id_to_meta = json.load(f)
        dim = index.d
        store = cls(dim)
        store.index = index
        store.id_to_meta = {int(k): v for k, v in id_to_meta.items()}
        store.ntotal = index.ntotal
        logger.info("Loaded index with dim=%d total=%d", dim, store.ntotal)
        return store

In [12]:
# ---------- Ollama wrapper (if no LLMClient) ----------

def ask_ollama_direct(prompt: str, model: str = OLLAMA_MODEL, base_url: str = OLLAMA_BASE, timeout: int = 60) -> str:
    """
    Direct Ollama HTTP call to /api/chat (non-streaming).
    Expects Ollama running locally at base_url.
    """
    try:
        url = f"{base_url}/api/chat"
        payload = {
            "model": model,
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt},
            ],
            "stream": False,
            "options": {"temperature": 0.2}
        }
        resp = requests.post(url, json=payload, timeout=timeout)
        resp.raise_for_status()
        data = resp.json()
        return data.get("message", {}).get("content", "")
    except Exception as e:
        logger.error("Ollama direct call failed: %s", e)
        return f"[Ollama error] {e}"

In [13]:
# ---------- Main RAG flow ----------

def build_index_from_pdf(pdf_path: str,
                         embedder: SentenceTransformer,
                         chunk_size_words: int = 120,
                         overlap_words: int = 20,
                         save_index: bool = True) -> SimpleFaissStore:
    """Extract, chunk, embed and build FAISS index. Returns SimpleFaissStore."""
    logger.info("Extracting text from PDF: %s", pdf_path)
    text = extract_text_from_pdf(pdf_path)
    if not text.strip():
        raise ValueError("No text extracted from PDF.")

    sentences = sentence_tokenize(text)
    logger.info("Extracted %d sentences from PDF", len(sentences))

    chunks = chunk_sentences(sentences, max_tokens=chunk_size_words, overlap=overlap_words)
    logger.info("Built %d chunks", len(chunks))

    # Build metadata list with ids and source info
    metas = []
    for i, chunk in enumerate(chunks):
        metas.append({
            "id": str(uuid.uuid4()),
            "chunk_index": i,
            "text": chunk[:2000]  # cap for safety
        })

    # Compute embeddings in batches to avoid memory spikes
    batch_size = 32
    embeddings = []
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i+batch_size]
        emb = embedder.encode(batch, convert_to_numpy=True, show_progress_bar=False)
        embeddings.append(emb)
        logger.info("Embedded batch %d-%d", i, i + len(batch) - 1)
    embeddings = np.vstack(embeddings)
    dim = embeddings.shape[1]
    logger.info("Embeddings shape: %s", embeddings.shape)

    store = SimpleFaissStore(dim)
    store.add(embeddings, metas)

    if save_index:
        store.save(FAISS_INDEX_PATH, CHUNKS_JSON)
        logger.info("Saved index and chunk metadata for future runs.")

    return store

In [16]:
def answer_query_with_rag(query: str, store: SimpleFaissStore, embedder: SentenceTransformer,
                          llm_client=None, top_k: int = TOP_K) -> str:
    """Retrieve relevant chunks and ask LLM with a context prompt."""
    # Embed query
    qvec = embedder.encode([query], convert_to_numpy=True)[0]
    results = store.search(qvec, k=top_k)
    logger.info("Retrieved %d results", len(results))

    # Build context text
    context_pieces = []
    for score, meta in results:
        txt = meta.get("text", "")
        context_pieces.append(f"(score:{score:.3f}) {txt}")
    context = "\n\n---\n\n".join(context_pieces)

    # Assemble prompt
    prompt = (
        "You are an expert assistant. Use the following extracted context from a Paracetamol leaflet "
        "to answer the question. If the context does not contain the answer, say you don't know.\n\n"
        f"CONTEXT:\n{context}\n\nQUESTION:\n{query}\n\nPlease answer concisely and cite which context chunk you used."
    )

    logger.info("Prompt length: %d chars", len(prompt))

    # Ask through provided llm_client (preferred) or direct Ollama call
    if llm_client is not None and HAS_LLM_CLIENT:
        try:
            res = llm_client.generate(user_prompt=prompt, system_prompt=None)
            return res.text
        except Exception as e:
            logger.warning("LLMClient failed: %s. Falling back to direct Ollama.", e)

    # fallback: direct Ollama HTTP call
    return ask_ollama_direct(prompt)

In [19]:
# ---------- Demo runner ----------

def run_demo(pdf_path=PDF_PATH):
    # 1. load embedder
    logger.info("Loading embedding model: %s", EMBED_MODEL_NAME)
    embedder = SentenceTransformer(EMBED_MODEL_NAME)

    # 2. build or load index
    if os.path.exists(FAISS_INDEX_PATH) and os.path.exists(CHUNKS_JSON):
        logger.info("Found saved index; loading...")
        store = SimpleFaissStore.load(FAISS_INDEX_PATH, CHUNKS_JSON)
    else:
        store = build_index_from_pdf(pdf_path, embedder, chunk_size_words=120, overlap_words=20, save_index=True)

    # 3. prepare LLM client if available
    llm_client = None
    if HAS_LLM_CLIENT:
        try:
            cfg = LLMConfig(provider=Provider.OLLAMA, model=OLLAMA_MODEL, base_url=OLLAMA_BASE, temperature=0.2)
            llm_client = LLMClient(cfg)
            logger.info("LLMClient prepared (Ollama).")
        except Exception as e:
            logger.warning("Failed to init LLMClient: %s", e)
            llm_client = None

    # 4. interactive queries
    print("\nRAG demo ready. Type a question about the Paracetamol leaflet (type 'exit' to quit).")
    while True:
        q = input("\nYour question > ").strip()
        if q.lower() in ("exit", "quit"):
            break
        if not q:
            continue
        answer = answer_query_with_rag(q, store, embedder, llm_client=llm_client, top_k=TOP_K)
        print("\n=== Answer ===\n")
        print(answer)
        print("\n==============\n")

In [None]:
if __name__ == "__main__":
    if not os.path.exists(PDF_PATH):
        print(f"Please put your paracetamol PDF at: {PDF_PATH} and re-run.")
        sys.exit(1)
    run_demo(PDF_PATH)

INFO:rag-demo:Loading embedding model: all-MiniLM-L6-v2
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:rag-demo:Extracting text from PDF: C:\Users\SurajPatra\Desktop\FASTAPI\GENAI_LIB\rag_tools\paracetamol.pdf
INFO:rag-demo:Extracted 9 sentences from PDF
INFO:rag-demo:Built 3 chunks
INFO:rag-demo:Embedded batch 0-2
INFO:rag-demo:Embeddings shape: (3, 384)
INFO:rag-demo:Added 3 vectors, total now 3
INFO:rag-demo:Saved index to data/paracetamol_faiss.index and metadata to data/paracetamol_chunks.json
INFO:rag-demo:Saved index and chunk metadata for future runs.
INFO:rag-demo:LLMClient prepared (Ollama).



RAG demo ready. Type a question about the Paracetamol leaflet (type 'exit' to quit).


Batches: 100%|██████████| 1/1 [00:00<00:00, 49.81it/s]
INFO:rag-demo:Retrieved 3 results
INFO:rag-demo:Prompt length: 2136 chars



=== Answer ===

 The other name of Paracetamol is acetaminophen. (Context: Paracetamol, also known as **acetaminophen**, is one of the most widely used medicines for relieving pain and reducing fever.)




# 🔹 Types of RAG & Their Use Cases

### 1. **Vanilla RAG (Basic RAG)**

* **How it works:**

  * Query → Embed → Search chunks in vector DB → Send retrieved chunks + query to LLM → LLM answers.
* **Strength:** Simple, easy to implement.
* **Weakness:** Context window limited; only top-k chunks are passed.
* **Use Cases:** FAQ bots, small document Q\&A (like medical leaflets, company manuals).

---

### 2. **RAG with Re-Ranking**

* **How it works:**

  * Retrieve top-k chunks.
  * Use a **cross-encoder re-ranker model** (like `ms-marco-MiniLM-L-12-v2`) to reorder results by semantic relevance.
  * Pass top-n (better) chunks to LLM.
* **Strength:** Much more accurate retrieval.
* **Weakness:** Extra compute cost.
* **Use Cases:** Legal/medical Q\&A, enterprise knowledge search, where **precision matters**.

---

### 3. **Multi-Vector RAG**

* **How it works:**

  * Instead of one embedding per chunk, multiple embeddings are generated (e.g., one for entities, one for summary, one for keywords).
  * Retrieval considers multiple “views” of the text.
* **Strength:** Handles complex queries (synonyms, multi-perspective search).
* **Weakness:** Larger storage, slower retrieval.
* **Use Cases:** Scientific papers, financial reports, customer support KBs.

---

### 4. **Hierarchical RAG (Tree-based / Map-Reduce RAG)**

* **How it works:**

  * Organize documents in hierarchy:

    * First retrieve relevant sections (chapter-level).
    * Then drill down (paragraph-level).
  * Or use **map-reduce summarization** (summarize chunks → merge summaries → final answer).
* **Strength:** Scales well to very large documents (books, multi-GB datasets).
* **Weakness:** Slight delay due to multi-step reasoning.
* **Use Cases:** E-discovery, research literature review, corporate compliance audits.

---

### 5. **Agentic RAG**

* **How it works:**

  * LLM is treated as an **agent** with tools.
  * It decides dynamically: “Should I query the vector DB? Which DB? Should I refine query?”
  * Sometimes uses **multi-hop reasoning** (ask DB → rephrase → ask again).
* **Strength:** Very flexible, interactive.
* **Weakness:** Expensive (multiple LLM calls).
* **Use Cases:** Customer support copilots, coding assistants, interactive chat with knowledge bases.

---

### 6. **Fusion RAG (Query/Answer Fusion)**

* **How it works:**

  * Create multiple variations of the query (query expansion).
  * Retrieve results for each.
  * Fuse results before sending to LLM.
* **Strength:** Handles vague/short queries well.
* **Weakness:** Retrieval can bring noise.
* **Use Cases:** Search engines, ecommerce product search, natural language business intelligence.

---

# 🎯 Summary

* **Vanilla RAG** → Simple FAQs, small docs.
* **Re-Ranking RAG** → High precision (law, medicine).
* **Multi-Vector RAG** → Deep/complex queries.
* **Hierarchical RAG** → Long books, enterprise data.
* **Agentic RAG** → Copilots, tool-using assistants.
* **Fusion RAG** → Search engines, vague queries.
