Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

Maastricht_University_logo.svg

# Information Retrieval and Text Mining Course
## Tutorial 11 — Conversational Search: Sticking to the Facts (RAG & Knowledge Graphs)

**Author:** Jan Scholtes

**Edition 2025-2026**

Department of Advanced Computer Sciences — Maastricht University

Welcome to Tutorial 11 on **Conversational Search: Sticking to the Facts**. This is the second of three tutorials on Conversational Search:

1. **The Basics** (Tutorial 10) — dialogue structure, query understanding, hybrid search, re-ranking, evaluation metrics
2. **Sticking to the Facts — RAG & Knowledge Graphs** (this tutorial) — retrieval-augmented generation, knowledge graph integration, hallucination detection
3. **Agentic Approaches** (Tutorial 12) — agents, memory, tools, multi-agent orchestration

In this tutorial we explore how to keep LLMs factually grounded using external knowledge. The topics covered are:

1. **Challenges in Conversational AI** — bias, hallucination, provenance, "stochastic parrots"
2. **What is RAG?** — the retrieval-augmented generation pipeline and its components
3. **Prompt-Level RAG (Early Fusion)** — injecting retrieved text into prompts
4. **Vector-Level RAG (Embedding Fusion)** — merging query and context embeddings
5. **Knowledge Graph Integration** — SPARQL, KG-augmented RAG, TransE, GraphSAGE
6. **Late Fusion & Memory-Augmented Methods** — cross-attention, kNN-LM, dynamic retrieval
7. **Building a Complete RAG Pipeline** — chunking, embedding, retrieval, generation
8. **Hallucination Detection with RAGAS** — faithfulness, answer relevancy, context precision
9. **RAG Trade-offs** — benefits, limitations, computational complexity
10. **Applied RAG Pipeline: Sherlock Holmes Chatbot** — building a full RAG chatbot using the corpus from Tutorials 03 & 07
11. **RAGAS Evaluation on Your Corpus** — quantitative evaluation of RAG modes (No-RAG vs BM25 vs Dense)

At the end you will find the **Exercises** section with graded assignments.

> **Note:** This course is about Information Retrieval, Text Mining, and Conversational Search — not about programming skills. The code cells below show you *how* these methods work in practice using Python libraries. Focus on understanding the **concepts** and **results**.

> **Cross-notebook arc:** In Tutorial 03 you built a search engine over your own text. In Tutorial 07 you built a Knowledge Graph, FAISS vector store, and QA test set from that corpus. In Sections 10–11 of this tutorial, you use all of those artifacts to build and evaluate a RAG chatbot.

## Library Installation

We install all required packages in a single cell. Run this cell once at the beginning of your session.

In [None]:
# Install required packages
import subprocess, sys

packages = [
    "sentence-transformers",
    "rank_bm25",
    "faiss-cpu",
    "scikit-learn",
    "openai",
    "tiktoken",
]
for pkg in packages:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

print("All packages installed successfully.")

## Library Imports

All imports are grouped here so the notebook is easy to set up and run.

In [None]:
# Core Python
import warnings
warnings.filterwarnings("ignore")

import os
import json
import hashlib
import textwrap
import getpass
import numpy as np
from collections import Counter

# BM25
from rank_bm25 import BM25Okapi

# Sentence Transformers (dense retrieval + cross-encoder)
from sentence_transformers import SentenceTransformer, CrossEncoder

# FAISS for vector search
import faiss

# Scikit-learn utilities
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler

# Tiktoken for token counting
import tiktoken

# OpenAI (for RAG generation)
from openai import OpenAI

print("All libraries imported successfully.")

## API Key Setup

We use OpenAI's API for the generation component of our RAG pipeline. Enter your API key below — it will not be stored or displayed.

In [None]:
# Set up OpenAI API key securely
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")
client = OpenAI()
print("OpenAI client configured.")

## Load Models

We load the embedding model and cross-encoder once, so they can be reused throughout the tutorial.

In [None]:
# Load models
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Token counter for context window management
enc = tiktoken.encoding_for_model("gpt-4o-mini")

print(f"Embedding model: all-MiniLM-L6-v2 (dim={embedding_model.get_sentence_embedding_dimension()})")
print(f"Cross-encoder: ms-marco-MiniLM-L-6-v2")
print("Models loaded.")

---
# 1. Challenges in Conversational AI

Before diving into RAG, we must understand *why* it is needed. Large Language Models face several fundamental challenges:

## The "Stochastic Parrot" Problem

LLMs are statistical next-word predictors trained on massive text corpora. They have:
- **No real knowledge** — only patterns learned from training data
- **No memory** — each conversation starts fresh (without external memory systems)
- **No understanding** — just sophisticated pattern matching
- **No built-in fact-checking** — they generate plausible-sounding text regardless of truth

> *"On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?"* — Bender, Gebru et al. (FAccT 2021)

## Key Challenges

| Challenge | Description | Impact |
|---|---|---|
| **Hallucination** | LLMs fabricate facts that sound convincing but are false | Users trust wrong information |
| **Bias** | Training data contains societal biases (Wikipedia, Reddit, news) | Unfair or skewed responses |
| **Provenance** | No way to trace *where* an answer comes from | Cannot verify claims |
| **Stale knowledge** | Training data has a cutoff date | Outdated information |
| **No domain expertise** | General models lack specialized knowledge | Poor performance on domain tasks |

## Solutions Brief

| Solution | Approach | Where in Pipeline |
|---|---|---|
| **Prompt Engineering** | Background prompts, CoT, context injection | Pre-generation |
| **RAG** | Retrieve external documents to ground generation | Pre-generation |
| **Knowledge Graphs** | Structured facts via SPARQL queries | Pre-generation |
| **RLHF** | Align model outputs with human preferences | Training time |
| **Hallucination Detection** | RAGAS, TruthfulQA metrics | Post-generation |
| **Agentic Architecture** | Agents with verification tools | Full pipeline (Tutorial 12) |

In this tutorial, we focus on **RAG**, **Knowledge Graphs**, and **Hallucination Detection**.

---
# 2. What is RAG? (Retrieval-Augmented Generation)

**RAG** is a hybrid approach that combines information retrieval with generative models to enhance factual grounding:

```
Without RAG:  Query ──────────────────────────────→ LLM → Response (may hallucinate)

With RAG:     Query → Retriever → Retrieved Docs ─→ LLM → Response (grounded in evidence)
                         ↑
                   Vector Store / Index
```

## RAG Pipeline Components

| Component | Role | Example Tools |
|---|---|---|
| **Data Ingestion** | Load documents from various sources | LangChain, LlamaIndex |
| **Chunking** | Split documents into manageable pieces | LangChain TextSplitters |
| **Embedding Model** | Convert text chunks to vectors | SentenceTransformers, OpenAI Embeddings |
| **Vector Store** | Store and search embeddings | FAISS, Pinecone, Weaviate, Chroma |
| **Retriever** | Find relevant passages for a query | BM25, DPR, ColBERT |
| **Re-Ranker** | Re-score candidates for precision | Cross-Encoders (ms-marco) |
| **Generator** | Produce final answer from query + context | GPT-4, FLAN-T5, LLaMA |
| **Evaluation** | Measure quality and detect hallucination | RAGAS, TruthfulQA |

## Benefits of RAG

- **Up-to-date output** — retrieves from current knowledge base, not frozen training data
- **Factual grounding** — answers are backed by retrieved evidence
- **Smaller models, better results** — retrieval reduces the LLM's knowledge burden
- **Domain adaptability** — swap the retrieval corpus for any domain
- **Explainability** — sources can be cited alongside the answer

## Sample Knowledge Base

We create a knowledge base of text passages about AI and Information Retrieval. This serves as our document collection for the RAG pipeline throughout this tutorial.

In [None]:
# Knowledge base: passages about AI and IR topics
knowledge_base = [
    {
        "id": "doc_001",
        "title": "BM25 Retrieval",
        "text": "BM25 (Best Matching 25) is a probabilistic retrieval function that ranks documents based on term frequency, inverse document frequency, and document length normalization. It is the default ranking function in Elasticsearch and Apache Solr. The formula uses parameters k1 (typically 1.2-2.0) for term frequency saturation and b (typically 0.75) for document length normalization.",
        "source": "IR Textbook, Chapter 5"
    },
    {
        "id": "doc_002",
        "title": "Dense Passage Retrieval (DPR)",
        "text": "Dense Passage Retrieval uses dual BERT encoders to independently encode questions and passages into dense vectors. The encoders are trained on question-passage pairs from Natural Questions and TriviaQA. DPR achieves significant improvements over BM25 on open-domain question answering by capturing semantic similarity beyond lexical matching.",
        "source": "Karpukhin et al., 2020"
    },
    {
        "id": "doc_003",
        "title": "RAG Architecture",
        "text": "Retrieval-Augmented Generation (RAG) combines a pre-trained retriever (e.g., DPR) with a pre-trained sequence-to-sequence generator (e.g., BART). The retriever finds relevant documents from a knowledge source, and the generator conditions on both the query and retrieved documents to produce the final output. RAG can be used in two modes: RAG-Sequence (retrieves for entire sequence) and RAG-Token (retrieves per output token).",
        "source": "Lewis et al., 2020"
    },
    {
        "id": "doc_004",
        "title": "Knowledge Graphs in AI",
        "text": "Knowledge Graphs store structured information as triples: (subject, predicate, object). Major public KGs include Wikidata, DBpedia, and YAGO. They can be queried using SPARQL, a graph query language similar to SQL. Knowledge Graphs provide factual grounding for AI systems by offering verified, structured relationships between entities.",
        "source": "Hogan et al., 2021"
    },
    {
        "id": "doc_005",
        "title": "TransE Embedding Model",
        "text": "TransE (Translating Embeddings) is a knowledge graph embedding method where relationships are modeled as translations in embedding space: h + r approximately equals t, where h is the head entity, r is the relation, and t is the tail entity. TransE is simple and efficient but struggles with one-to-many and many-to-many relations. It produces embeddings that can be used in vector-level RAG.",
        "source": "Bordes et al., 2013"
    },
    {
        "id": "doc_006",
        "title": "RAGAS Evaluation Framework",
        "text": "RAGAS (Retrieval-Augmented Generation Assessment) evaluates RAG pipelines using four metrics: Faithfulness (fraction of answer claims supported by context), Answer Relevancy (relevance of answer to question), Context Precision (proportion of relevant items in retrieved context), and Context Recall (how much of the ground truth is covered by context). Faithfulness is the primary metric for hallucination detection.",
        "source": "Es et al., 2023"
    },
    {
        "id": "doc_007",
        "title": "Chain-of-Thought Prompting",
        "text": "Chain-of-Thought (CoT) prompting enables LLMs to solve complex reasoning tasks by including intermediate reasoning steps in the prompt. By providing worked examples with step-by-step reasoning, LLMs produce more accurate and verifiable outputs. CoT can be combined with RAG to improve both retrieval queries and answer generation.",
        "source": "Wei et al., 2023"
    },
    {
        "id": "doc_008",
        "title": "FAISS Vector Database",
        "text": "FAISS (Facebook AI Similarity Search) is an open-source library for efficient similarity search and clustering of dense vectors. It supports various index types including flat (exact), IVF (inverted file), and HNSW (hierarchical navigable small world) for approximate nearest neighbor search. FAISS reduces retrieval complexity from O(N) to O(log N) and is essential for production RAG systems.",
        "source": "Johnson et al., 2019"
    },
    {
        "id": "doc_009",
        "title": "Hallucination in LLMs",
        "text": "Hallucination occurs when an LLM generates text that is fluent and plausible but factually incorrect or unsupported by the provided context. Types include intrinsic hallucination (contradicting source material) and extrinsic hallucination (adding unverifiable information). Mitigation strategies include RAG, knowledge graph grounding, RLHF, and post-generation fact-checking with frameworks like RAGAS and TruthfulQA.",
        "source": "Ji et al., 2023"
    },
    {
        "id": "doc_010",
        "title": "Fusion-in-Decoder (FiD)",
        "text": "Fusion-in-Decoder is a RAG variant where each retrieved passage is independently encoded by the encoder, and the decoder cross-attends to all encoded passages. Unlike simple prompt concatenation, FiD can handle a large number of retrieved passages efficiently because the encoder processes each passage separately, while the decoder fuses information during generation.",
        "source": "Izacard & Grave, 2020"
    },
    {
        "id": "doc_011",
        "title": "kNN-LM (Memory-Augmented Generation)",
        "text": "The k-Nearest Neighbor Language Model (kNN-LM) augments a pre-trained language model by interpolating its output distribution with a distribution derived from nearest neighbor lookups in a datastore. At each generation step, the model queries an external memory of (context, target) pairs, and the final probability is a weighted combination: p = lambda * p_kNN + (1-lambda) * p_LM. This enables domain adaptation without retraining.",
        "source": "Khandelwal et al., 2020"
    },
    {
        "id": "doc_012",
        "title": "GraphSAGE for Node Embeddings",
        "text": "GraphSAGE (SAmple and aggreGatE) is a graph neural network that generates node embeddings by sampling and aggregating features from local neighborhoods. For each node, it samples neighbors, aggregates their features (using mean, LSTM, or pooling), and concatenates with the node's own features. GraphSAGE is inductive: it can generate embeddings for unseen nodes, making it suitable for dynamic knowledge graphs in RAG.",
        "source": "Hamilton et al., 2017"
    },
]

# Extract texts for embedding
kb_texts = [doc["text"] for doc in knowledge_base]
kb_titles = [doc["title"] for doc in knowledge_base]

print(f"Knowledge base loaded: {len(knowledge_base)} documents")
for doc in knowledge_base:
    print(f"  [{doc['id']}] {doc['title']} — {doc['text'][:70]}...")

---
# 3. Prompt-Level RAG (Early Fusion)

The simplest and most common form of RAG operates at the **text/prompt level**. The idea is straightforward:

1. **Retrieve** relevant documents for the query
2. **Concatenate** them into the prompt as context
3. **Generate** an answer conditioned on that context

```
User Query: "What is BM25?"
          ↓
   [Retriever finds top-3 passages]
          ↓
   Prompt: "Given the following context:
            [Retrieved passage 1]
            [Retrieved passage 2]
            [Retrieved passage 3]
            
            Answer the question: What is BM25?"
          ↓
   [LLM generates grounded answer]
```

## Methods

| Method | Description |
|---|---|
| **Simple Concatenation** | Append top-k retrieved passages before the query |
| **Prompt Templates** | Structured injection: "Given this context: {context}, answer: {query}" |
| **Selective Context Injection** | Only inject the most relevant sentences, not full passages |
| **Context Window Management** | Dynamic token allocation — prioritize high-similarity chunks |

## Key Papers
- **Lewis et al., 2020** — *Retrieval-Augmented Generation for Knowledge-Intensive NLP* (original RAG)
- **Izacard & Grave, 2020** — *Fusion-in-Decoder* (FiD) — all passages as decoder inputs

Let's build a prompt-level RAG pipeline step by step:

In [None]:
# Step 1: Build the retrieval index

# Encode all knowledge base documents
kb_embeddings = embedding_model.encode(kb_texts, convert_to_numpy=True, show_progress_bar=False)
print(f"Encoded {len(kb_embeddings)} documents → shape: {kb_embeddings.shape}")

# Build FAISS index (inner product on normalized vectors = cosine similarity)
dim = kb_embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
kb_embeddings_norm = kb_embeddings.copy()
faiss.normalize_L2(kb_embeddings_norm)
index.add(kb_embeddings_norm)

# Also build BM25 index for hybrid search
tokenized_kb = [doc.lower().split() for doc in kb_texts]
bm25 = BM25Okapi(tokenized_kb)

print(f"FAISS index built: {index.ntotal} vectors, dim={dim}")
print(f"BM25 index built: {len(tokenized_kb)} documents")

In [None]:
# Step 2: Build retriever functions

def retrieve_dense(query, top_k=3):
    """Retrieve documents using dense (semantic) search."""
    q_emb = embedding_model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb, top_k)
    return [(int(idx), float(score)) for idx, score in zip(indices[0], scores[0])]

def retrieve_bm25(query, top_k=3):
    """Retrieve documents using BM25 keyword search."""
    tokenized_query = query.lower().split()
    scores = bm25.get_scores(tokenized_query)
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [(int(idx), float(scores[idx])) for idx in top_indices]

def retrieve_hybrid(query, top_k=3, alpha=0.5):
    """Hybrid retrieval: combine BM25 and dense scores."""
    # BM25 scores
    bm25_scores = bm25.get_scores(query.lower().split())
    
    # Dense scores
    q_emb = embedding_model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    dense_scores_raw, dense_ids = index.search(q_emb, len(kb_texts))
    dense_scores = np.zeros(len(kb_texts))
    for i, idx in enumerate(dense_ids[0]):
        dense_scores[idx] = dense_scores_raw[0][i]
    
    # Normalize both to [0,1]
    scaler = MinMaxScaler()
    bm25_norm = scaler.fit_transform(bm25_scores.reshape(-1, 1)).flatten()
    dense_norm = scaler.fit_transform(dense_scores.reshape(-1, 1)).flatten()
    
    # Fuse
    hybrid = alpha * bm25_norm + (1 - alpha) * dense_norm
    top_indices = np.argsort(hybrid)[::-1][:top_k]
    return [(int(idx), float(hybrid[idx])) for idx in top_indices]

def rerank(query, candidates):
    """Re-rank candidates using a cross-encoder."""
    pairs = [(query, kb_texts[idx]) for idx, _ in candidates]
    ce_scores = cross_encoder.predict(pairs)
    reranked = sorted(zip([idx for idx, _ in candidates], ce_scores),
                      key=lambda x: x[1], reverse=True)
    return [(idx, float(score)) for idx, score in reranked]

# Test the retriever
query = "How does RAG work?"
print(f"Query: \"{query}\"\n")

for name, results in [("Dense", retrieve_dense(query)),
                      ("BM25", retrieve_bm25(query)),
                      ("Hybrid", retrieve_hybrid(query))]:
    print(f"  {name} Top-3:")
    for idx, score in results:
        print(f"    [{kb_titles[idx]:40s}] score={score:.4f}")
    print()

In [None]:
# Step 3: Construct the RAG prompt

def build_rag_prompt(query, retrieved_indices, max_tokens=2000):
    """
    Build a prompt-level RAG prompt with retrieved context.
    Includes context window management (token counting).
    """
    # System prompt with instructions
    system_prompt = (
        "You are a helpful AI assistant specialized in Information Retrieval and Text Mining. "
        "Answer the user's question based ONLY on the provided context. "
        "If the context does not contain enough information to answer, say 'I don't have enough information to answer this.' "
        "Always cite the source document when making a claim."
    )
    
    # Build context from retrieved documents
    context_parts = []
    total_tokens = len(enc.encode(system_prompt))
    
    for idx, score in retrieved_indices:
        doc = knowledge_base[idx]
        passage = f"[Source: {doc['source']}] {doc['title']}: {doc['text']}"
        passage_tokens = len(enc.encode(passage))
        
        if total_tokens + passage_tokens > max_tokens:
            print(f"  ⚠ Context window limit reached ({total_tokens}/{max_tokens} tokens), skipping remaining docs")
            break
        
        context_parts.append(passage)
        total_tokens += passage_tokens
    
    context = "\n\n".join(context_parts)
    
    user_message = f"""Based on the following context, answer the question.

CONTEXT:
{context}

QUESTION: {query}

ANSWER (cite sources):"""
    
    return system_prompt, user_message, total_tokens

# Demonstrate prompt construction
query = "What is RAG and how does it work?"
results = retrieve_hybrid(query, top_k=3)
results = rerank(query, results)

system_prompt, user_message, token_count = build_rag_prompt(query, results)

print(f"Query: \"{query}\"")
print(f"Total context tokens: {token_count}")
print(f"\n{'='*60}")
print(f"SYSTEM PROMPT:\n{system_prompt}")
print(f"\n{'='*60}")
print(f"USER MESSAGE:\n{user_message[:800]}...")

In [None]:
# Step 4: Generate answer with RAG

def rag_answer(query, top_k=3, alpha=0.5, use_rerank=True, model="gpt-4o-mini"):
    """Complete RAG pipeline: retrieve → rerank → prompt → generate."""
    # Retrieve
    candidates = retrieve_hybrid(query, top_k=top_k, alpha=alpha)
    
    # Re-rank
    if use_rerank:
        candidates = rerank(query, candidates)
    
    # Build prompt
    system_prompt, user_message, tokens = build_rag_prompt(query, candidates)
    
    # Generate
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ],
        temperature=0.2,  # Low temperature for factual answers
        max_tokens=500,
    )
    
    answer = response.choices[0].message.content
    
    # Collect sources
    sources = [knowledge_base[idx] for idx, _ in candidates]
    
    return {
        "query": query,
        "answer": answer,
        "sources": sources,
        "context_tokens": tokens,
        "model": model,
    }

# Test the full RAG pipeline
result = rag_answer("What is RAG and how does it work?")

print(f"Query: \"{result['query']}\"")
print(f"Model: {result['model']} | Context tokens: {result['context_tokens']}")
print(f"\nAnswer:\n{result['answer']}")
print(f"\nSources:")
for s in result['sources']:
    print(f"  - {s['title']} ({s['source']})")

### Comparing RAG vs. No-RAG

Let's see the difference between asking the LLM directly (no RAG) and using our RAG pipeline:

In [None]:
# Compare RAG vs. no-RAG responses

test_queries = [
    "What is the RAGAS framework and how does it detect hallucinations?",
    "How does kNN-LM augment language model generation?",
    "What are the differences between TransE and GraphSAGE?",
]

for query in test_queries:
    print(f"{'='*70}")
    print(f"Query: \"{query}\"\n")
    
    # No RAG (direct LLM)
    no_rag = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": query}],
        temperature=0.2,
        max_tokens=200,
    ).choices[0].message.content
    
    # With RAG
    rag_result = rag_answer(query)
    
    print(f"WITHOUT RAG:")
    print(f"  {no_rag[:300]}...")
    print(f"\nWITH RAG:")
    print(f"  {rag_result['answer'][:300]}...")
    print(f"  Sources: {', '.join(s['title'] for s in rag_result['sources'])}")
    print()

**Observation:** The RAG-augmented answers are:
- **Grounded** in specific, verifiable sources
- **More precise** because they draw from curated knowledge
- **Citable** — every claim can be traced to a source document

Without RAG, the LLM generates from its training data, which may be outdated, incomplete, or fabricated.

---
# 4. Vector-Level RAG (Embedding Fusion)

While prompt-level RAG injects text into prompts, **vector-level RAG** operates in embedding space:

## Key Methods

### A. Query Expansion in Embedding Space
Expand the query embedding with vectors from relevant documents:
- **Averaging:** $\vec{q}_{\text{new}} = \text{mean}(\vec{q}, \vec{d}_1, \vec{d}_2, \ldots)$
- **Weighted averaging:** $\vec{q}_{\text{new}} = \alpha \cdot \vec{q} + (1 - \alpha) \cdot \text{mean}(\vec{d}_1, \ldots, \vec{d}_k)$
- **Learned transformation:** A small MLP fuses query + context embeddings

### B. Cross-Attention Vector Fusion
Before generation, the query embedding is modified by attending over retrieved context vectors:
1. Query vector attends over retrieved context vectors via attention layers
2. New query embedding is computed by weighted attention
3. This contextualized query is sent to the LLM

### The Alignment Problem
Vectors from different sources (text embeddings, KG embeddings, graph embeddings) may live in **different vector spaces**. Before fusion, they must be **aligned**:

$$\hat{v} = \frac{v}{\|v\|}$$

Methods for alignment include:
- Linear projection layers
- Contrastive learning (CLIP-style)
- Adapter networks
- Joint training

Let's demonstrate query expansion in embedding space:

In [None]:
# Vector-Level RAG: Query Expansion in Embedding Space

def vector_expanded_retrieval(query, top_k_expand=3, top_k_final=3, alpha=0.7):
    """
    1. Retrieve top-k docs using original query
    2. Expand query vector using retrieved doc vectors
    3. Re-retrieve using expanded query vector
    """
    # Step 1: Initial retrieval
    q_emb = embedding_model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    scores, indices = index.search(q_emb, top_k_expand)
    
    initial_results = [(int(idx), float(s)) for idx, s in zip(indices[0], scores[0])]
    
    # Step 2: Vector expansion — weighted average of query + retrieved docs
    doc_embs = kb_embeddings_norm[indices[0]]  # Embeddings of retrieved docs
    expanded_q = alpha * q_emb + (1 - alpha) * np.mean(doc_embs, axis=0, keepdims=True)
    faiss.normalize_L2(expanded_q)
    
    # Step 3: Re-retrieve with expanded query
    scores2, indices2 = index.search(expanded_q, top_k_final)
    expanded_results = [(int(idx), float(s)) for idx, s in zip(indices2[0], scores2[0])]
    
    return initial_results, expanded_results

# Compare original vs. expanded retrieval
queries = [
    "How to prevent AI from making things up?",       # Indirect phrasing
    "embedding methods for graph structures",          # Vague query
    "combining search with text generation",           # Paraphrase of RAG
]

for query in queries:
    initial, expanded = vector_expanded_retrieval(query)
    
    print(f"Query: \"{query}\"")
    print(f"  Original retrieval:  {[kb_titles[idx] for idx, _ in initial]}")
    print(f"  Expanded retrieval:  {[kb_titles[idx] for idx, _ in expanded]}")
    
    # Check if expansion changed the results
    orig_set = set(idx for idx, _ in initial)
    exp_set = set(idx for idx, _ in expanded)
    new_docs = exp_set - orig_set
    if new_docs:
        print(f"  ★ New documents found: {[kb_titles[idx] for idx in new_docs]}")
    print()

**Observation:** Vector expansion enriches the query representation with information from initially retrieved documents. This is particularly useful for:
- **Vague or paraphrased queries** — the expanded vector captures more relevant semantics
- **Domain-specific terminology** — initial results add domain vocabulary to the query vector
- **Multi-hop reasoning** — retrieved documents may bridge to other relevant documents

This is the embedding-level equivalent of query expansion discussed in Tutorial 10.

---
# 5. Knowledge Graph Integration (KG-Augmented RAG)

Knowledge Graphs provide **structured, verified facts** that complement unstructured text retrieval:

## What is a Knowledge Graph?

A KG stores information as **triples**: (Subject, Predicate, Object)
- Example: `(Albert_Einstein, bornIn, Ulm)`, `(Albert_Einstein, wonAward, Nobel_Prize_Physics)`

## Major Public Knowledge Graphs

| Knowledge Graph | Description | SPARQL Endpoint |
|---|---|---|
| **Wikidata** | Community-curated structured data from Wikimedia | query.wikidata.org |
| **DBpedia** | Structured content extracted from Wikipedia | dbpedia.org/sparql |
| **YAGO** | Combines Wikidata + Wikipedia + WordNet | yago-knowledge.org |
| **ConceptNet** | Commonsense knowledge graph | N/A (API-based) |
| **UMLS** | Unified Medical Language System | N/A (license required) |

## SPARQL — Querying Knowledge Graphs

SPARQL is to Knowledge Graphs what SQL is to relational databases:

```sparql
SELECT ?person ?personLabel ?birthPlace ?birthPlaceLabel
WHERE {
  ?person wdt:P31 wd:Q5 .          # Instance of human
  ?person wdt:P166 wd:Q35637 .     # Won Nobel Prize in Physics
  ?person wdt:P19 ?birthPlace .     # Place of birth
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
LIMIT 10
```

## KG-Augmented RAG Methods

| Method | Description |
|---|---|
| **KG Triples as Text** | Convert triples to natural language, inject into prompt |
| **KG Embeddings (TransE)** | Embed entities/relations as vectors: $h + r \approx t$ |
| **GNN Embeddings (GraphSAGE)** | Embed graph structure using message-passing neural networks |
| **SPARQL + LLM** | LLM generates SPARQL queries, results injected back into prompt |

Let's demonstrate KG-augmented RAG:

In [None]:
# Knowledge Graph Integration: KG triples as text for RAG

# Simulated Knowledge Graph triples (in practice, these come from Wikidata/DBpedia via SPARQL)
kg_triples = [
    ("BM25", "is_a", "retrieval function"),
    ("BM25", "used_in", "Elasticsearch"),
    ("BM25", "used_in", "Apache Solr"),
    ("BM25", "uses", "term frequency"),
    ("BM25", "uses", "inverse document frequency"),
    ("RAG", "is_a", "hybrid approach"),
    ("RAG", "combines", "information retrieval"),
    ("RAG", "combines", "language model generation"),
    ("RAG", "proposed_by", "Lewis et al. 2020"),
    ("RAG", "reduces", "hallucination"),
    ("FAISS", "is_a", "vector database"),
    ("FAISS", "developed_by", "Meta AI"),
    ("FAISS", "supports", "approximate nearest neighbor search"),
    ("FAISS", "complexity", "O(log N) with IVF indexing"),
    ("TransE", "is_a", "knowledge graph embedding method"),
    ("TransE", "formula", "h + r ≈ t"),
    ("TransE", "proposed_by", "Bordes et al. 2013"),
    ("GraphSAGE", "is_a", "graph neural network"),
    ("GraphSAGE", "uses", "neighborhood sampling and aggregation"),
    ("GraphSAGE", "property", "inductive (generalizes to unseen nodes)"),
    ("RAGAS", "is_a", "evaluation framework"),
    ("RAGAS", "measures", "faithfulness"),
    ("RAGAS", "measures", "answer relevancy"),
    ("RAGAS", "measures", "context precision"),
    ("RAGAS", "detects", "hallucination"),
]

def kg_to_text(entity, triples):
    """Convert KG triples about an entity to natural language context."""
    entity_triples = [(s, p, o) for s, p, o in triples if s.lower() == entity.lower()]
    if not entity_triples:
        return None
    
    sentences = []
    for s, p, o in entity_triples:
        # Convert triple to natural language
        p_readable = p.replace("_", " ")
        sentences.append(f"{s} {p_readable} {o}.")
    
    return f"Knowledge Graph facts about {entity}: " + " ".join(sentences)

def rag_with_kg(query, top_k=3):
    """RAG pipeline augmented with Knowledge Graph triples."""
    # Extract potential entities from query (simple keyword matching)
    entities_found = []
    for s, p, o in kg_triples:
        if s.lower() in query.lower():
            if s not in entities_found:
                entities_found.append(s)
    
    # Get KG context
    kg_context_parts = []
    for entity in entities_found:
        kg_text = kg_to_text(entity, kg_triples)
        if kg_text:
            kg_context_parts.append(kg_text)
    
    # Get document retrieval context
    candidates = retrieve_hybrid(query, top_k=top_k)
    candidates = rerank(query, candidates)
    
    # Build combined context
    doc_context = "\n\n".join(
        f"[Source: {knowledge_base[idx]['source']}] {knowledge_base[idx]['title']}: {knowledge_base[idx]['text']}"
        for idx, _ in candidates
    )
    kg_context = "\n".join(kg_context_parts) if kg_context_parts else "No KG triples found."
    
    return doc_context, kg_context, entities_found

# Demonstrate KG-augmented RAG
query = "What is FAISS and how does it relate to RAG?"
doc_ctx, kg_ctx, entities = rag_with_kg(query)

print(f"Query: \"{query}\"\n")
print(f"Entities detected: {entities}")
print(f"\n--- Knowledge Graph Context ---")
print(kg_ctx)
print(f"\n--- Document Context (top passages) ---")
print(doc_ctx[:500] + "...")

In [None]:
# Generate answer with KG-augmented RAG

def rag_with_kg_answer(query, top_k=3, model="gpt-4o-mini"):
    """Full KG-augmented RAG pipeline."""
    doc_ctx, kg_ctx, entities = rag_with_kg(query, top_k)
    
    system_prompt = (
        "You are a helpful AI assistant specialized in Information Retrieval and Text Mining. "
        "Answer based ONLY on the provided context (both document passages and knowledge graph facts). "
        "If the context does not contain enough information, say so. "
        "Always cite sources."
    )
    
    user_message = f"""Based on the following context, answer the question.

KNOWLEDGE GRAPH FACTS:
{kg_ctx}

DOCUMENT PASSAGES:
{doc_ctx}

QUESTION: {query}

ANSWER (cite sources, distinguish between KG facts and document passages):"""
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ],
        temperature=0.2,
        max_tokens=500,
    )
    
    return response.choices[0].message.content

# Test KG-augmented RAG
queries = [
    "What is FAISS and how does it relate to RAG?",
    "How do TransE and GraphSAGE differ for knowledge graph embeddings?",
    "How does RAGAS detect hallucinations?",
]

for query in queries:
    print(f"{'='*65}")
    print(f"Query: \"{query}\"\n")
    answer = rag_with_kg_answer(query)
    print(f"Answer:\n{answer}")
    print()

**Observation:** The KG-augmented RAG combines:
1. **Structured facts** from the Knowledge Graph (precise, verified relationships)
2. **Unstructured passages** from the document retrieval (detailed explanations)

This provides the LLM with both precise factual anchors (from the KG) and rich contextual detail (from documents). In production systems like **Google's AI Overview**, this combination of structured (Knowledge Graph) and unstructured (web documents) retrieval is key to factual accuracy.

---
# 6. RAG Fusion Methods: A Complete Overview

The lecture presents six main methods for integrating retrieval with generation. Here is a comprehensive comparison:

| Level | Method | How It Works | Vector Fusion? | Key Reference |
|---|---|---|---|---|
| **Prompt Level** (Early Fusion) | Text Concatenation | Retrieve text → append to prompt → generate | No | Lewis et al. 2020 (RAG) |
| **Vector Level** (Embedding Fusion) | Query Expansion / Fusion | Expand query vector with retrieved doc vectors | **Yes** | Xiong 2020 (ANCE), Guu 2020 (REALM) |
| **KG-Based** | KG Triples as Text or Embeddings | Convert KG triples to text or embed with TransE/GNN | **Yes** (if embedding) | Zhang 2021 (DKPLM) |
| **Late Fusion** | Cross-Attention | Model attends to query + retrieved vectors during decoding | Learned via attention | Lewis 2020 (RAG-Token), Izacard 2020 (FiD) |
| **Memory-Augmented** | kNN-LM, RETRO | Query external memory at each generation step; interpolate distributions | At probability level | Khandelwal 2020, Borgeaud 2022 |
| **Dynamic Retrieval** | Retrieval during decoding | Re-retrieve as new tokens are generated | Via attention | Shi 2023 (DCR/REPLUG) |

## Late Fusion: Cross-Attention during Decoding

In late fusion, the model does **not** modify the query embedding before generation. Instead:
1. At each decoding step, the token being generated has a decoder hidden state
2. This hidden state **cross-attends** to all retrieved document representations
3. The generation is influenced by attention over the retrieved passages

This is how **Fusion-in-Decoder (FiD)** works: each passage is encoded separately, and the decoder fuses information dynamically.

## Memory-Augmented Methods (kNN-LM)

At each generation step:
$$p_{\text{final}} = \lambda \cdot p_{\text{kNN}} + (1 - \lambda) \cdot p_{\text{LM}}$$

- $p_{\text{LM}}$ = the language model's own probability distribution
- $p_{\text{kNN}}$ = distribution from nearest neighbor lookup in external datastore
- $\lambda$ = interpolation weight

**Analogy:**
- Memory-Augmented = like a **library** the model can visit anytime
- Dynamic Retrieval = like a **personal librarian** handing over documents while the model writes each word

---
# 7. Building a Complete RAG Pipeline

Let's build a more sophisticated RAG pipeline that includes:
1. **Document chunking** with overlap
2. **Hybrid retrieval** (BM25 + dense)
3. **Cross-encoder re-ranking**
4. **Context window management** (token counting)
5. **Source citation** in generated answers

In [None]:
# Document Chunking with Overlap

def chunk_text(text, chunk_size=200, overlap=50):
    """
    Split text into overlapping chunks.
    chunk_size and overlap are in words (not tokens).
    """
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap  # Overlap
    return chunks

# Demonstrate chunking on a longer document
long_document = (
    "Retrieval-Augmented Generation (RAG) is a paradigm that enhances large language models "
    "by integrating external knowledge retrieval into the generation process. The core idea is "
    "that instead of relying solely on the parametric knowledge stored in the model's weights, "
    "the system retrieves relevant documents from an external corpus and uses them as additional "
    "context for generation. This approach was first formalized by Lewis et al. in 2020, who "
    "proposed using a dense passage retriever (DPR) combined with a BART sequence-to-sequence "
    "model. The retriever finds the most relevant passages from a large document collection, "
    "and the generator produces an answer conditioned on both the query and the retrieved passages. "
    "RAG has two variants: RAG-Sequence, where the same set of documents is used for the entire "
    "generation, and RAG-Token, where different documents can be retrieved for each output token. "
    "Since its introduction, RAG has become the standard approach for building factually grounded "
    "AI systems. Modern implementations use FAISS for efficient vector search, sentence transformers "
    "for embedding generation, and cross-encoders for re-ranking. The RAGAS framework provides "
    "metrics for evaluating RAG pipeline quality, including faithfulness (detecting hallucinations), "
    "answer relevancy, context precision, and context recall. Major companies like Google (AI Overview), "
    "Microsoft (Bing Chat), and Perplexity AI all use RAG-based architectures for their conversational "
    "search products. Recent advances include memory-augmented methods like kNN-LM, which interpolate "
    "the model's output distribution with nearest-neighbor lookups in an external datastore, and "
    "dynamic retrieval methods that re-retrieve as new tokens are generated."
)

chunks = chunk_text(long_document, chunk_size=50, overlap=10)

print(f"Document: {len(long_document.split())} words")
print(f"Chunks: {len(chunks)} (size=50 words, overlap=10 words)\n")
for i, chunk in enumerate(chunks):
    tokens = len(enc.encode(chunk))
    print(f"Chunk {i}: ({len(chunk.split())} words, {tokens} tokens)")
    print(f"  {chunk[:120]}...")
    print()

In [None]:
# Complete RAG Pipeline with all components

class RAGPipeline:
    """
    A complete RAG pipeline demonstrating:
    - Document chunking and indexing
    - Hybrid retrieval (BM25 + dense)
    - Cross-encoder re-ranking
    - Context window management
    - Source citation
    """
    
    def __init__(self, documents, embedding_model, cross_encoder, client,
                 chunk_size=100, chunk_overlap=20):
        self.client = client
        self.embedding_model = embedding_model
        self.cross_encoder = cross_encoder
        self.enc = tiktoken.encoding_for_model("gpt-4o-mini")
        
        # Chunk documents
        self.chunks = []
        self.chunk_metadata = []
        for doc in documents:
            doc_chunks = chunk_text(doc["text"], chunk_size, chunk_overlap)
            for i, chunk in enumerate(doc_chunks):
                self.chunks.append(chunk)
                self.chunk_metadata.append({
                    "doc_id": doc["id"],
                    "title": doc["title"],
                    "source": doc["source"],
                    "chunk_idx": i,
                })
        
        # Build BM25 index
        self.tokenized_chunks = [c.lower().split() for c in self.chunks]
        self.bm25 = BM25Okapi(self.tokenized_chunks)
        
        # Build FAISS index
        self.chunk_embeddings = self.embedding_model.encode(
            self.chunks, convert_to_numpy=True, show_progress_bar=False
        )
        dim = self.chunk_embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dim)
        self.chunk_embeddings_norm = self.chunk_embeddings.copy()
        faiss.normalize_L2(self.chunk_embeddings_norm)
        self.index.add(self.chunk_embeddings_norm)
        
        print(f"RAG Pipeline initialized:")
        print(f"  Documents: {len(documents)}")
        print(f"  Chunks: {len(self.chunks)} (size={chunk_size}, overlap={chunk_overlap})")
        print(f"  FAISS index: {self.index.ntotal} vectors, dim={dim}")
    
    def retrieve(self, query, top_k=5, alpha=0.5):
        """Hybrid retrieval with BM25 + dense search."""
        # BM25
        bm25_scores = self.bm25.get_scores(query.lower().split())
        
        # Dense
        q_emb = self.embedding_model.encode([query], convert_to_numpy=True)
        faiss.normalize_L2(q_emb)
        dense_scores_raw, dense_ids = self.index.search(q_emb, len(self.chunks))
        dense_scores = np.zeros(len(self.chunks))
        for i, idx in enumerate(dense_ids[0]):
            dense_scores[idx] = dense_scores_raw[0][i]
        
        # Normalize and fuse
        scaler = MinMaxScaler()
        bm25_norm = scaler.fit_transform(bm25_scores.reshape(-1, 1)).flatten()
        dense_norm = scaler.fit_transform(dense_scores.reshape(-1, 1)).flatten()
        hybrid = alpha * bm25_norm + (1 - alpha) * dense_norm
        
        top_indices = np.argsort(hybrid)[::-1][:top_k]
        return [(int(idx), float(hybrid[idx])) for idx in top_indices]
    
    def rerank(self, query, candidates):
        """Cross-encoder re-ranking."""
        pairs = [(query, self.chunks[idx]) for idx, _ in candidates]
        scores = self.cross_encoder.predict(pairs)
        reranked = sorted(zip([idx for idx, _ in candidates], scores),
                          key=lambda x: x[1], reverse=True)
        return [(idx, float(s)) for idx, s in reranked]
    
    def answer(self, query, top_k=5, max_context_tokens=1500, model="gpt-4o-mini"):
        """Full RAG: retrieve → rerank → prompt → generate."""
        # Retrieve and rerank
        candidates = self.retrieve(query, top_k=top_k)
        candidates = self.rerank(query, candidates)
        
        # Build context with token management
        context_parts = []
        sources_used = []
        total_tokens = 0
        
        for idx, score in candidates:
            chunk_tokens = len(self.enc.encode(self.chunks[idx]))
            if total_tokens + chunk_tokens > max_context_tokens:
                break
            context_parts.append(
                f"[{self.chunk_metadata[idx]['title']} — {self.chunk_metadata[idx]['source']}]: "
                f"{self.chunks[idx]}"
            )
            sources_used.append(self.chunk_metadata[idx])
            total_tokens += chunk_tokens
        
        context = "\n\n".join(context_parts)
        
        system_prompt = (
            "You are a helpful AI assistant. Answer based ONLY on the provided context. "
            "Cite sources in [brackets]. If unsure, say 'I don't have enough information.'"
        )
        
        user_msg = f"CONTEXT:\n{context}\n\nQUESTION: {query}\n\nANSWER:"
        
        response = self.client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_msg}
            ],
            temperature=0.2,
            max_tokens=400,
        )
        
        return {
            "query": query,
            "answer": response.choices[0].message.content,
            "sources": sources_used,
            "context_tokens": total_tokens,
            "chunks_used": len(context_parts),
        }

# Initialize the pipeline
pipeline = RAGPipeline(
    documents=knowledge_base,
    embedding_model=embedding_model,
    cross_encoder=cross_encoder,
    client=client,
    chunk_size=80,
    chunk_overlap=15,
)

In [None]:
# Test the complete RAG pipeline

test_queries = [
    "What is the difference between BM25 and dense retrieval?",
    "How can knowledge graphs help prevent hallucination?",
    "Explain the kNN-LM approach to memory-augmented generation.",
]

for query in test_queries:
    result = pipeline.answer(query)
    
    print(f"{'='*65}")
    print(f"Query: \"{result['query']}\"")
    print(f"Chunks used: {result['chunks_used']} | Context tokens: {result['context_tokens']}")
    print(f"\nAnswer:\n{result['answer']}")
    print(f"\nSources:")
    for s in result['sources']:
        print(f"  - {s['title']} ({s['source']}, chunk {s['chunk_idx']})")
    print()

---
# 8. Hallucination Detection with RAGAS Metrics

**RAGAS** (Retrieval-Augmented Generation Assessment) is a framework for evaluating RAG pipelines. Its key metric for hallucination detection is **Faithfulness**:

$$\text{Faithfulness} = \frac{\text{Number of claims supported by context}}{\text{Total number of claims in the answer}}$$

## RAGAS Metrics Overview

| Metric | What It Measures | Range |
|---|---|---|
| **Faithfulness** | Are all claims in the answer supported by the retrieved context? | 0–1 (1 = perfect) |
| **Answer Relevancy** | Is the answer relevant to the question? | 0–1 |
| **Context Precision** | Are retrieved passages actually relevant? | 0–1 |
| **Context Recall** | Does the retrieved context cover all aspects of the ground truth? | 0–1 |

## How Faithfulness Works

1. **Claim Extraction:** Break the answer into individual factual claims
2. **Context Verification:** Check if each claim is supported by the retrieved context
3. **Score Calculation:** Fraction of supported claims

### Example

| Claim | Supported by Context? |
|---|---|
| "BM25 uses term frequency" | ✅ Yes |
| "BM25 was developed by Google" | ❌ No (hallucination!) |
| "BM25 is used in Elasticsearch" | ✅ Yes |

Faithfulness = 2/3 = 0.67 → The answer contains a hallucination.

Let's implement these metrics:

In [None]:
# RAGAS-Style Evaluation Metrics

def extract_claims(answer):
    """
    Simple claim extraction: split answer into sentences.
    In production RAGAS, an LLM performs this step.
    """
    import re
    sentences = re.split(r'[.!?]+', answer)
    claims = [s.strip() for s in sentences if len(s.strip()) > 10]
    return claims

def faithfulness_score(answer, context, model, threshold=0.5):
    """
    Measure faithfulness: fraction of answer claims supported by context.
    Uses embedding similarity as a proxy for NLI entailment.
    """
    claims = extract_claims(answer)
    if not claims:
        return 1.0, []
    
    context_emb = model.encode([context])
    claim_embs = model.encode(claims)
    
    similarities = cosine_similarity(claim_embs, context_emb).flatten()
    
    claim_results = []
    supported = 0
    for claim, sim in zip(claims, similarities):
        is_supported = sim > threshold
        if is_supported:
            supported += 1
        claim_results.append({
            "claim": claim,
            "similarity": float(sim),
            "supported": is_supported,
        })
    
    score = supported / len(claims)
    return score, claim_results

def answer_relevancy_score(question, answer, model):
    """Semantic similarity between question and answer."""
    embs = model.encode([question, answer])
    return float(cosine_similarity([embs[0]], [embs[1]])[0][0])

def context_precision_score(question, context_passages, model, threshold=0.35):
    """Fraction of retrieved passages that are relevant to the question."""
    q_emb = model.encode([question])
    c_embs = model.encode(context_passages)
    sims = cosine_similarity(q_emb, c_embs)[0]
    relevant = sum(1 for s in sims if s > threshold)
    return relevant / len(context_passages), sims

# === Demonstration ===
context_text = knowledge_base[0]["text"]  # BM25 passage

# Good answer (grounded in context)
good_answer = (
    "BM25 is a probabilistic retrieval function that ranks documents based on "
    "term frequency and inverse document frequency. It uses parameters k1 and b "
    "for term frequency saturation and document length normalization respectively. "
    "BM25 is the default ranking function in Elasticsearch and Apache Solr."
)

# Hallucinated answer
bad_answer = (
    "BM25 is a deep learning model developed by Google in 2019. "
    "It uses transformer attention mechanisms to understand query semantics. "
    "BM25 requires GPU training on large datasets and outputs dense vectors. "
    "It is used in Elasticsearch for keyword search."
)

print("=== Faithfulness Evaluation ===\n")

for label, answer in [("GOOD (grounded)", good_answer), ("BAD (hallucinated)", bad_answer)]:
    score, results = faithfulness_score(answer, context_text, embedding_model)
    
    print(f"--- {label} ---")
    print(f"Faithfulness score: {score:.2f}\n")
    for r in results:
        status = "✅ Supported" if r["supported"] else "❌ NOT supported"
        print(f"  [{r['similarity']:.3f}] {status}")
        print(f"    \"{r['claim']}\"")
    print()

# Answer relevancy
question = "What is BM25 and how does it rank documents?"
rel_good = answer_relevancy_score(question, good_answer, embedding_model)
rel_bad = answer_relevancy_score(question, bad_answer, embedding_model)
print(f"=== Answer Relevancy ===")
print(f"  Good answer: {rel_good:.4f}")
print(f"  Bad answer:  {rel_bad:.4f}")

**Observation:** The faithfulness metric clearly distinguishes grounded answers from hallucinated ones:
- The **good answer** has high faithfulness — all claims are supported by the BM25 context passage
- The **bad answer** contains claims about "deep learning", "transformer attention", and "GPU training" that are NOT in the context → low faithfulness = hallucination detected!

In production RAGAS implementations, an LLM performs the claim extraction and verification (using NLI — Natural Language Inference), which is more accurate than our simplified embedding similarity approach.

In [None]:
# End-to-End: RAG Pipeline + Hallucination Detection

def evaluate_rag_answer(pipeline, query, ground_truth_doc_ids=None):
    """Run RAG pipeline and evaluate the answer for hallucination."""
    result = pipeline.answer(query)
    
    # Build context from what was actually retrieved
    context_chunks = []
    for s in result["sources"]:
        chunk_idx_in_pipeline = None
        for i, meta in enumerate(pipeline.chunk_metadata):
            if meta["doc_id"] == s["doc_id"] and meta["chunk_idx"] == s["chunk_idx"]:
                chunk_idx_in_pipeline = i
                break
        if chunk_idx_in_pipeline is not None:
            context_chunks.append(pipeline.chunks[chunk_idx_in_pipeline])
    
    full_context = " ".join(context_chunks)
    
    # Faithfulness
    faith, claim_details = faithfulness_score(
        result["answer"], full_context, embedding_model
    )
    
    # Answer relevancy
    relevancy = answer_relevancy_score(query, result["answer"], embedding_model)
    
    # Context precision
    ctx_prec, _ = context_precision_score(query, context_chunks, embedding_model)
    
    return {
        "query": query,
        "answer": result["answer"],
        "faithfulness": faith,
        "answer_relevancy": relevancy,
        "context_precision": ctx_prec,
        "claims": claim_details,
        "sources": result["sources"],
    }

# Evaluate multiple queries
eval_queries = [
    "What is BM25 and how does it work?",
    "How does the RAGAS framework evaluate RAG pipelines?",
    "What is Fusion-in-Decoder?",
]

print("=== End-to-End RAG Evaluation ===\n")
for query in eval_queries:
    eval_result = evaluate_rag_answer(pipeline, query)
    
    print(f"Query: \"{eval_result['query']}\"")
    print(f"  Faithfulness:      {eval_result['faithfulness']:.2f}")
    print(f"  Answer Relevancy:  {eval_result['answer_relevancy']:.4f}")
    print(f"  Context Precision: {eval_result['context_precision']:.2f}")
    
    # Flag potential hallucinations
    unsupported = [c for c in eval_result['claims'] if not c['supported']]
    if unsupported:
        print(f"  ⚠ Potential hallucinations ({len(unsupported)} unsupported claims):")
        for c in unsupported:
            print(f"    ❌ \"{c['claim'][:80]}...\"" if len(c['claim']) > 80 else f"    ❌ \"{c['claim']}\"")
    else:
        print(f"  ✅ All claims appear supported by context")
    
    print(f"  Sources: {', '.join(s['title'] for s in eval_result['sources'])}")
    print()

---
# 9. RAG Trade-offs: Benefits, Limitations & Computational Complexity

## Benefits
- **Factual grounding** — answers backed by retrieved evidence
- **Explainability** — sources can be cited
- **Domain adaptability** — swap the corpus for any domain
- **No retraining** — update knowledge by updating the document store
- **Smaller models, better results** — retrieval offloads knowledge from model weights

## Limitations

| Limitation | Description |
|---|---|
| **Retrieval bottleneck** | If the retriever misses key info, the LLM can't generate a complete answer |
| **Restricted answer space** | RAG can only answer about what's in the indexed documents |
| **Lexical bias** | Short queries may not retrieve all relevant passages |
| **Added latency** | Multiple expensive steps (embedding, retrieval, re-ranking, generation) |
| **Context window limits** | LLMs have token limits — can't fit all retrieved documents |

## Computational Complexity

| RAG Step | Method | Complexity |
|---|---|---|
| **Embedding** | Dense (BERT-based) | $O(n \cdot d)$ per query |
| **Retrieval** | Brute-force search | $O(N)$ — linear in corpus size |
| **Retrieval** | ANN (FAISS with IVF) | $O(\log N)$ — logarithmic |
| **Re-ranking** | Cross-encoder | $O(k \cdot n)$ — k candidates × sequence length |
| **Generation** | LLM inference | $O(T \cdot k)$ — T output tokens × k context passages |

## Optimization Strategies

| Strategy | How | Benefit |
|---|---|---|
| **Hybrid Retrieval** | BM25 + Dense + KG | Better recall across query types |
| **Efficient ANN** | FAISS IVF, HNSW | $O(N) \rightarrow O(\log N)$ retrieval |
| **Dynamic Truncation** | Token-aware context selection | Fit within context window |
| **Adaptive Retrieval** | Fact queries → KG, Open queries → Dense | Right tool for each query type |
| **Caching** | Cache embeddings and frequent queries | Reduce repeated computation |

---
# 10. Applied RAG Pipeline: Sherlock Holmes Chatbot

In Tutorials 03 and 07, you built a search engine and prepared a vector store, Knowledge Graph, atomic facts, and QA test sets from *The Adventures of Sherlock Holmes*. Now we bring everything together to build **a complete RAG chatbot** that can answer questions about the stories.

**Data pipeline across tutorials:**
```
Tutorial 03 → custom_corpus/chunks/           (text chunks)
Tutorial 07 → custom_corpus/vector_store/     (FAISS index + embeddings)
Tutorial 07 → custom_corpus/kg/               (Knowledge Graph)
Tutorial 07 → custom_corpus/evaluation/       (atomic facts + QA test set)
Tutorial 11 → RAG chatbot + RAGAS evaluation  (this section)
```

## 10.1 Load All Artifacts

In [None]:
# Load all artifacts produced by Tutorials 03 and 07
import glob

CORPUS_BASE = os.path.join(os.path.expanduser("~"), "custom_corpus")
VECTOR_DIR = os.path.join(CORPUS_BASE, "vector_store")
KG_DIR = os.path.join(CORPUS_BASE, "kg")
EVAL_DIR = os.path.join(CORPUS_BASE, "evaluation")

# --- 1. Load chunk texts ---
chunks_path = os.path.join(VECTOR_DIR, "chunks.json")
if os.path.isfile(chunks_path):
    with open(chunks_path, "r", encoding="utf-8") as f:
        sherlock_chunks = json.load(f)
else:
    # Fallback: load from individual .txt files
    chunk_files = sorted(glob.glob(os.path.join(CORPUS_BASE, "chunks", "*.txt")))
    sherlock_chunks = []
    for fpath in chunk_files:
        with open(fpath, "r", encoding="utf-8") as f:
            text = f.read().strip()
            if len(text) > 20:
                sherlock_chunks.append(text)

print(f"Loaded {len(sherlock_chunks)} Sherlock Holmes chunks")

# --- 2. Load FAISS index ---
faiss_path = os.path.join(VECTOR_DIR, "faiss_index.bin")
if os.path.isfile(faiss_path):
    sherlock_index = faiss.read_index(faiss_path)
    print(f"FAISS index loaded: {sherlock_index.ntotal} vectors")
else:
    print("WARNING: FAISS index not found. Re-encoding chunks...")
    embs = embedding_model.encode(sherlock_chunks, convert_to_numpy=True, show_progress_bar=True)
    sherlock_index = faiss.IndexFlatIP(embs.shape[1])
    faiss.normalize_L2(embs)
    sherlock_index.add(embs)
    print(f"FAISS index rebuilt: {sherlock_index.ntotal} vectors")

# --- 3. Load embeddings ---
emb_path = os.path.join(VECTOR_DIR, "chunk_embeddings.npy")
if os.path.isfile(emb_path):
    sherlock_embeddings = np.load(emb_path)
    print(f"Embeddings loaded: {sherlock_embeddings.shape}")
else:
    sherlock_embeddings = embedding_model.encode(sherlock_chunks, convert_to_numpy=True)
    print(f"Embeddings recomputed: {sherlock_embeddings.shape}")

# --- 4. Load Knowledge Graph ---
kg_path = os.path.join(KG_DIR, "sherlock_kg.json")
if os.path.isfile(kg_path):
    import networkx as nx
    with open(kg_path, "r", encoding="utf-8") as f:
        kg_data = json.load(f)
    sherlock_kg = nx.Graph()
    for node in kg_data["nodes"]:
        sherlock_kg.add_node(node["name"], entity_type=node["entity_type"], frequency=node["frequency"])
    for edge in kg_data["edges"]:
        sherlock_kg.add_edge(edge["source"], edge["target"], weight=edge["weight"])
    print(f"Knowledge Graph loaded: {sherlock_kg.number_of_nodes()} nodes, {sherlock_kg.number_of_edges()} edges")
else:
    sherlock_kg = None
    print("WARNING: Knowledge Graph not found. KG-augmented RAG will be skipped.")

# --- 5. Load QA test set ---
qa_path = os.path.join(EVAL_DIR, "qa_test_set.json")
if os.path.isfile(qa_path):
    with open(qa_path, "r", encoding="utf-8") as f:
        sherlock_qa = json.load(f)
    print(f"QA test set loaded: {len(sherlock_qa)} pairs")
else:
    sherlock_qa = []
    print("WARNING: QA test set not found. Run Tutorial 07 Sections 7-9 first.")

# --- 6. Load atomic facts ---
facts_path = os.path.join(EVAL_DIR, "atomic_facts.json")
if os.path.isfile(facts_path):
    with open(facts_path, "r", encoding="utf-8") as f:
        sherlock_facts = json.load(f)
    print(f"Atomic facts loaded: {len(sherlock_facts)} facts")
else:
    sherlock_facts = []
    print("WARNING: Atomic facts not found.")

## 10.2 Build the Sherlock Holmes RAG Chatbot

We build a RAG chatbot using the same components from Section 7, but now on a real corpus. The chatbot supports three retrieval modes:

| Mode | Description |
|---|---|
| **No-RAG** | LLM answers from its own training data (no grounding) |
| **BM25-RAG** | Keyword-based retrieval + LLM generation |
| **Dense-RAG** | Semantic FAISS retrieval + cross-encoder reranking + LLM generation |

In [None]:
# Build BM25 index over Sherlock Holmes chunks
from rank_bm25 import BM25Okapi

sherlock_bm25_tokens = [c.lower().split() for c in sherlock_chunks]
sherlock_bm25 = BM25Okapi(sherlock_bm25_tokens)
print(f"BM25 index built over {len(sherlock_chunks)} Sherlock Holmes chunks")

# --- Retrieval functions for the Sherlock corpus ---

def sherlock_retrieve_dense(query, top_k=5):
    """Dense retrieval using FAISS."""
    q_emb = embedding_model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    scores, indices = sherlock_index.search(q_emb, top_k)
    return [(int(idx), float(score)) for idx, score in zip(indices[0], scores[0])]

def sherlock_retrieve_bm25(query, top_k=5):
    """BM25 keyword retrieval."""
    scores = sherlock_bm25.get_scores(query.lower().split())
    top_indices = np.argsort(scores)[::-1][:top_k]
    return [(int(idx), float(scores[idx])) for idx in top_indices]

def sherlock_rerank(query, candidates):
    """Re-rank candidates using cross-encoder."""
    pairs = [(query, sherlock_chunks[idx]) for idx, _ in candidates]
    ce_scores = cross_encoder.predict(pairs)
    reranked = sorted(zip([idx for idx, _ in candidates], ce_scores),
                      key=lambda x: x[1], reverse=True)
    return [(idx, float(s)) for idx, s in reranked]

def sherlock_kg_context(query, kg, top_k_entities=3):
    """Get KG-based context by finding entities mentioned in the query."""
    if kg is None:
        return ""
    context_parts = []
    for node in kg.nodes():
        if node.lower() in query.lower():
            # Get all neighbours of this entity
            neighbors = list(kg.neighbors(node))
            node_type = kg.nodes[node].get("entity_type", "?")
            if neighbors:
                neighbor_str = ", ".join(neighbors[:10])
                context_parts.append(
                    f"KG: {node} ({node_type}) is connected to: {neighbor_str}"
                )
    return "\n".join(context_parts[:top_k_entities])

# --- The RAG Chatbot ---

def sherlock_rag(query, mode="dense", top_k=5, model="gpt-4o-mini"):
    """
    Sherlock Holmes RAG chatbot.
    
    Modes: 'none' (no RAG), 'bm25', 'dense', 'dense+kg'
    """
    context = ""
    sources = []
    
    if mode == "none":
        # No retrieval — LLM answers from its own knowledge
        context = "(No external context provided)"
    
    elif mode == "bm25":
        candidates = sherlock_retrieve_bm25(query, top_k)
        sources = candidates
        context = "\n\n".join(
            f"[Passage {i+1}]: {sherlock_chunks[idx][:500]}"
            for i, (idx, _) in enumerate(candidates)
        )
    
    elif mode in ("dense", "dense+kg"):
        candidates = sherlock_retrieve_dense(query, top_k)
        candidates = sherlock_rerank(query, candidates)
        sources = candidates
        context = "\n\n".join(
            f"[Passage {i+1}]: {sherlock_chunks[idx][:500]}"
            for i, (idx, _) in enumerate(candidates)
        )
        
        if mode == "dense+kg":
            kg_ctx = sherlock_kg_context(query, sherlock_kg)
            if kg_ctx:
                context = f"KNOWLEDGE GRAPH FACTS:\n{kg_ctx}\n\nDOCUMENT PASSAGES:\n{context}"
    
    system_prompt = (
        "You are a literary assistant specializing in Sherlock Holmes stories by Arthur Conan Doyle. "
        "Answer the user's question based ONLY on the provided context passages. "
        "If the context doesn't contain the answer, say 'The provided passages don't contain this information.' "
        "Always reference which passage(s) support your answer."
    )
    
    user_msg = f"CONTEXT:\n{context}\n\nQUESTION: {query}\n\nANSWER:"
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_msg}
        ],
        temperature=0.2,
        max_tokens=400,
    )
    
    return {
        "query": query,
        "mode": mode,
        "answer": response.choices[0].message.content,
        "context": context,
        "sources": sources,
    }

print("Sherlock Holmes RAG chatbot ready!")

## 10.3 Compare RAG Modes: No-RAG vs BM25 vs Dense vs Dense+KG

Let's ask the chatbot the same questions using different retrieval modes and see how grounding affects answer quality:

In [None]:
# Compare the four RAG modes on Sherlock Holmes questions

sherlock_test_queries = [
    "Who is Irene Adler and what is her role in the story?",
    "What is the significance of 221B Baker Street?",
    "How does Holmes solve the mystery of the Red-Headed League?",
    "What is the relationship between Holmes and Watson?",
]

modes = ["none", "bm25", "dense", "dense+kg"]

for query in sherlock_test_queries:
    print(f"\n{'='*70}")
    print(f"QUERY: \"{query}\"\n")
    
    for mode in modes:
        result = sherlock_rag(query, mode=mode)
        answer_preview = result["answer"][:250]
        n_sources = len(result["sources"])
        print(f"  [{mode.upper():10s}] ({n_sources} passages)")
        print(f"  {answer_preview}...")
        print()

**Observation:** Notice how each mode differs:
- **No-RAG**: The LLM uses its general knowledge of Sherlock Holmes — may hallucinate details or mix up stories
- **BM25-RAG**: Finds passages with matching keywords — good for specific names and terms
- **Dense-RAG**: Finds semantically similar passages — better for paraphrased queries  
- **Dense+KG**: Adds structured entity relationships from the Knowledge Graph — provides extra factual anchoring

The more grounding we provide, the more specific and verifiable the answers become.

---
# 11. RAGAS Evaluation on Your Corpus

Now we evaluate the Sherlock Holmes chatbot using the RAGAS metrics from Section 8. We use the QA test set and atomic facts generated in Tutorial 07.

For each QA pair, we:
1. **Retrieve** context using the dense pipeline
2. **Generate** a RAG-grounded answer
3. **Measure** faithfulness, answer relevancy, and context precision

This gives us a quantitative view of how well our RAG pipeline performs on real data — and where it hallucinates.

In [None]:
# RAGAS evaluation of the Sherlock Holmes RAG chatbot
# We reuse the faithfulness_score, answer_relevancy_score, context_precision_score
# functions from Section 8.

def evaluate_sherlock_rag(qa_pairs, mode="dense", max_queries=20):
    """Run the RAG chatbot on QA pairs and evaluate with RAGAS metrics."""
    results = []
    
    for i, qa in enumerate(qa_pairs[:max_queries]):
        question = qa["question"]
        ground_truth = qa["ground_truth_answer"]
        gt_chunk_idx = qa.get("ground_truth_chunk_idx", None)
        
        # Get RAG answer
        rag_result = sherlock_rag(question, mode=mode, top_k=5)
        answer = rag_result["answer"]
        
        # Build context string from retrieved passages
        context_text = rag_result["context"]
        
        # Compute RAGAS metrics
        faith, claim_details = faithfulness_score(answer, context_text, embedding_model, threshold=0.45)
        relevancy = answer_relevancy_score(question, answer, embedding_model)
        
        # Context precision: did we retrieve the ground-truth chunk?
        retrieved_indices = [idx for idx, _ in rag_result["sources"]]
        gt_retrieved = 1.0 if gt_chunk_idx in retrieved_indices else 0.0
        
        results.append({
            "question": question,
            "answer": answer[:200],
            "ground_truth": ground_truth[:200],
            "faithfulness": faith,
            "answer_relevancy": relevancy,
            "gt_retrieved": gt_retrieved,
            "n_unsupported": len([c for c in claim_details if not c["supported"]]),
            "n_claims": len(claim_details),
        })
        
        if (i + 1) % 5 == 0:
            print(f"  Evaluated {i + 1}/{min(len(qa_pairs), max_queries)} queries...")
    
    return results

# Run evaluation
if sherlock_qa:
    print(f"Evaluating {min(len(sherlock_qa), 20)} QA pairs with Dense-RAG mode...\n")
    eval_results = evaluate_sherlock_rag(sherlock_qa, mode="dense", max_queries=20)
    
    # Summary statistics
    avg_faith = np.mean([r["faithfulness"] for r in eval_results])
    avg_relev = np.mean([r["answer_relevancy"] for r in eval_results])
    avg_gt_ret = np.mean([r["gt_retrieved"] for r in eval_results])
    
    print(f"\n{'='*60}")
    print(f"RAGAS Evaluation Summary ({len(eval_results)} queries, mode=dense)")
    print(f"{'='*60}")
    print(f"  Average Faithfulness:      {avg_faith:.4f}")
    print(f"  Average Answer Relevancy:  {avg_relev:.4f}")
    print(f"  Ground-Truth Retrieval:    {avg_gt_ret:.2%}")
    print()
    
    # Show individual results
    print(f"{'Question':<50s} {'Faith':>6s} {'Relev':>6s} {'GT?':>4s}")
    print("-" * 70)
    for r in eval_results:
        q_short = r["question"][:48]
        gt_mark = "✅" if r["gt_retrieved"] else "❌"
        print(f"{q_short:<50s} {r['faithfulness']:>6.2f} {r['answer_relevancy']:>6.2f} {gt_mark:>4s}")
else:
    print("No QA test set available. Please run Tutorial 07 Sections 7-9 first.")

## 11.1 Compare RAG Modes on RAGAS Metrics

Let's compare No-RAG, BM25-RAG, and Dense-RAG quantitatively. We expect Dense-RAG to have the highest faithfulness (least hallucination) because its answers are grounded in semantically relevant passages.

In [None]:
# Compare RAG modes on RAGAS metrics
# We evaluate a smaller subset to keep API costs low

EVAL_SUBSET = 10  # Number of QA pairs per mode

if sherlock_qa and len(sherlock_qa) >= EVAL_SUBSET:
    mode_results = {}
    
    for mode in ["none", "bm25", "dense"]:
        print(f"Evaluating mode: {mode}...")
        mode_results[mode] = evaluate_sherlock_rag(
            sherlock_qa, mode=mode, max_queries=EVAL_SUBSET
        )
    
    # Build comparison table
    print(f"\n{'='*65}")
    print(f"RAGAS Comparison Across RAG Modes ({EVAL_SUBSET} queries each)")
    print(f"{'='*65}")
    print(f"{'Mode':<12s} {'Faithfulness':>13s} {'Relevancy':>10s} {'GT Retrieved':>13s}")
    print(f"{'-'*48}")
    
    for mode, results in mode_results.items():
        avg_f = np.mean([r["faithfulness"] for r in results])
        avg_r = np.mean([r["answer_relevancy"] for r in results])
        avg_g = np.mean([r["gt_retrieved"] for r in results])
        print(f"{mode.upper():<12s} {avg_f:>13.4f} {avg_r:>10.4f} {avg_g:>12.2%}")
    
    # Visualize comparison
    import matplotlib.pyplot as plt
    
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    modes_list = list(mode_results.keys())
    colors = ["#d62728", "#ff7f0e", "#2ca02c"]
    
    for ax_idx, (metric, label) in enumerate([
        ("faithfulness", "Faithfulness"),
        ("answer_relevancy", "Answer Relevancy"),
        ("gt_retrieved", "Ground-Truth Retrieved"),
    ]):
        values = [np.mean([r[metric] for r in mode_results[m]]) for m in modes_list]
        bars = axes[ax_idx].bar(
            [m.upper() for m in modes_list], values,
            color=colors, alpha=0.8, edgecolor="black"
        )
        axes[ax_idx].set_title(label, fontsize=13, fontweight="bold")
        axes[ax_idx].set_ylim(0, 1.05)
        axes[ax_idx].axhline(y=0.5, color="gray", linestyle="--", alpha=0.3)
        
        # Add value labels on bars
        for bar, val in zip(bars, values):
            axes[ax_idx].text(
                bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.02,
                f"{val:.2f}", ha="center", va="bottom", fontsize=11, fontweight="bold"
            )
    
    fig.suptitle("RAG Mode Comparison — RAGAS Metrics on Sherlock Holmes", fontsize=14, fontweight="bold")
    plt.tight_layout()
    plt.show()
else:
    print("Not enough QA pairs for comparison. Run Tutorial 07 Sections 7-9 first.")

**Key Takeaways:**

- **No-RAG** typically has **lower faithfulness** because the LLM generates answers from its training data, which may not match the specific passages in our corpus
- **BM25-RAG** improves faithfulness by providing keyword-matched context, but may miss semantically relevant passages
- **Dense-RAG** usually achieves the **highest faithfulness** because the semantic search finds the most relevant passages, reducing the LLM's tendency to hallucinate
- **Ground-truth retrieval rate** tells us how often the correct source chunk actually appears in the retrieved context — this is effectively the **context recall** metric from RAGAS

This demonstrates the fundamental value proposition of RAG: **external retrieval grounds the LLM's output in verifiable evidence**.

---
# Summary

| Concept | What We Learned |
|---|---|
| **Challenges** | LLMs hallucinate, have stale knowledge, no built-in fact-checking — RAG addresses all of these |
| **RAG Pipeline** | Retrieve → Augment → Generate — the standard architecture for grounded AI |
| **Prompt-Level RAG** | Inject retrieved text into prompts (early fusion) — simplest and most common |
| **Vector-Level RAG** | Expand query vectors with retrieved doc embeddings — captures deeper semantics |
| **Knowledge Graphs** | Structured facts via SPARQL, KG embeddings (TransE, GraphSAGE) — factual anchoring |
| **Late Fusion** | Cross-attention during decoding (FiD) — dynamic information fusion |
| **Memory-Augmented** | kNN-LM interpolates LM output with external memory lookups |
| **Hallucination Detection** | RAGAS faithfulness metric: fraction of claims supported by context |
| **Trade-offs** | RAG adds latency and is limited by retrieval quality, but provides factual grounding and citability |
| **Applied RAG Chatbot** | Built a real chatbot on Sherlock Holmes using artifacts from Tutorials 03 & 07 |
| **RAGAS on Real Data** | Compared No-RAG, BM25-RAG, and Dense-RAG — Dense-RAG shows highest faithfulness |

### Key RAG Methods Hierarchy

```
RAG Methods
├── Prompt Level (Early Fusion) ← Simple, most common
│   └── Text concatenation, templates, selective injection
├── Vector Level (Embedding Fusion)
│   └── Query expansion, weighted averaging, learned MLP
├── KG-Based
│   └── TransE embeddings, GraphSAGE, SPARQL + LLM
├── Late Fusion
│   └── Cross-attention (FiD), re-ranking
├── Memory-Augmented
│   └── kNN-LM, RETRO
└── Dynamic Retrieval
    └── Re-retrieve during decoding (DCR/REPLUG)
```

### Cross-Tutorial Pipeline

```
Tutorial 03: Text → Chunks → Lucene Index → BM25 Search
                 ↓
Tutorial 07: Chunks → Knowledge Graph + FAISS Vector Store + Atomic Facts + QA Test Set
                 ↓
Tutorial 11: All artifacts → RAG Chatbot → RAGAS Evaluation → Hallucination Detection
```

### What's Next?

- **Tutorial 12:** Agentic Approaches — agents with memory, tools, guardrails, multi-agent orchestration

---
# Exercises

The following exercises are graded. Please provide your answers in the designated cells below.

## Exercise 1 — RAG Fusion Methods Comparison (5 points)

Compare **Prompt-Level RAG** (Early Fusion) and **Vector-Level RAG** (Embedding Fusion) as strategies for grounding LLM outputs. In your answer, address:

1. How does each method inject retrieved knowledge into the generation process?
2. What is the "alignment problem" in vector-level RAG and why does it matter?
3. In what scenario would you choose prompt-level RAG over vector-level RAG, and vice versa?

Write your answer in the cell below (minimum 150 words).

YOUR ANSWER HERE

YOUR ANSWER HERE

## Exercise 2 — Hallucination Detection with RAGAS (5 points)

A RAG pipeline retrieves the following context about FAISS:

> "FAISS (Facebook AI Similarity Search) is an open-source library for efficient similarity search. It supports various index types including flat, IVF, and HNSW for approximate nearest neighbor search. FAISS reduces retrieval complexity from O(N) to O(log N)."

The LLM generates this answer:

> "FAISS is a closed-source vector database developed by Google. It uses transformer-based indexing to achieve O(1) constant-time retrieval. FAISS is the only library that supports approximate nearest neighbor search, and it is primarily used for image classification tasks."

Using the RAGAS faithfulness metric:

1. Extract each factual claim from the generated answer.
2. For each claim, determine whether it is supported, contradicted, or absent from the context.
3. Calculate the faithfulness score.
4. Explain what this score tells us about the answer quality.

Write your answer in the cell below (minimum 150 words).

YOUR ANSWER HERE

YOUR ANSWER HERE

## Exercise 3 — Implementing Retrieval Evaluation (10 points)

Write code that evaluates the **retrieval quality** of three methods (BM25, Dense, Hybrid) on our knowledge base. Your code should:

1. Use the retriever functions defined earlier (`retrieve_bm25`, `retrieve_dense`, `retrieve_hybrid`)
2. For each query below, compute the **top-3** results from each retrieval method
3. Compute **Mean Reciprocal Rank (MRR)** and **Precision@3** for each method
4. Store results in: `mrr_bm25`, `mrr_dense`, `mrr_hybrid` (floats) and `p3_bm25`, `p3_dense`, `p3_hybrid` (floats)

Use these query-answer pairs (value = index of the expected best document in `knowledge_base`):
```python
eval_queries = {
    "How does BM25 rank documents?": 0,                  # BM25 Retrieval
    "What is retrieval augmented generation?": 2,         # RAG Architecture
    "How do you evaluate RAG for hallucination?": 5,      # RAGAS Evaluation
    "graph neural networks for knowledge embedding": 11,  # GraphSAGE
    "memory augmented language model": 10,                # kNN-LM
}
```

**Precision@3** = fraction of queries where the expected document appears in the top-3 results.

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE
raise NotImplementedError("Replace this line with your solution")

In [None]:
# Autograder test cell — do not modify
assert 'mrr_bm25' in dir(), "You need to define 'mrr_bm25'"
assert 'mrr_dense' in dir(), "You need to define 'mrr_dense'"
assert 'mrr_hybrid' in dir(), "You need to define 'mrr_hybrid'"
assert 'p3_bm25' in dir(), "You need to define 'p3_bm25'"
assert 'p3_dense' in dir(), "You need to define 'p3_dense'"
assert 'p3_hybrid' in dir(), "You need to define 'p3_hybrid'"
assert isinstance(mrr_bm25, float), "mrr_bm25 should be a float"
assert isinstance(mrr_dense, float), "mrr_dense should be a float"
assert isinstance(mrr_hybrid, float), "mrr_hybrid should be a float"
assert isinstance(p3_bm25, float), "p3_bm25 should be a float"
assert isinstance(p3_dense, float), "p3_dense should be a float"
assert isinstance(p3_hybrid, float), "p3_hybrid should be a float"
assert 0 <= mrr_bm25 <= 1 and 0 <= mrr_dense <= 1 and 0 <= mrr_hybrid <= 1
assert 0 <= p3_bm25 <= 1 and 0 <= p3_dense <= 1 and 0 <= p3_hybrid <= 1
print(f"{'Method':<12} {'MRR':>8} {'P@3':>8}")
print("-" * 30)
print(f"{'BM25':<12} {mrr_bm25:>8.4f} {p3_bm25:>8.2f}")
print(f"{'Dense':<12} {mrr_dense:>8.4f} {p3_dense:>8.2f}")
print(f"{'Hybrid':<12} {mrr_hybrid:>8.4f} {p3_hybrid:>8.2f}")
print(f"\nAll auto-graded tests passed!")

## Exercise 4 — Build & Evaluate a RAG Chatbot on Your Own Corpus (15 points)

In Tutorials 03 and 07 (Exercise 4), you built a search engine, Knowledge Graph, FAISS index, and QA test set over your own text corpus. Now build a **RAG chatbot** and evaluate it with RAGAS metrics.

Your task:
1. **Load your corpus data** — chunks, FAISS index, KG, and QA test set from your custom corpus (the data you created in Tutorials 03 and 07). *(2 pts)*
2. **Build a RAG chatbot** — implement at least two modes: No-RAG and Dense-RAG. Use the `sherlock_rag()` function as a template. *(4 pts)*
3. **Generate RAG answers** for at least 10 questions from your QA test set using both modes. *(3 pts)*
4. **Evaluate with RAGAS** — compute faithfulness and answer relevancy for each mode. Store the average scores in `my_faith_norag`, `my_faith_dense`, `my_relev_norag`, `my_relev_dense` (floats). *(4 pts)*
5. **Analyze** (in a markdown cell, minimum 100 words): What patterns do you observe? Does Dense-RAG always outperform No-RAG on faithfulness? Where does the pipeline fail? *(2 pts)*

YOUR ANSWER HERE

In [None]:
# YOUR CODE HERE — Exercise 4
# Follow the steps outlined above.
# Store your results in: my_faith_norag, my_faith_dense, my_relev_norag, my_relev_dense

raise NotImplementedError("Replace this line with your solution")

YOUR ANALYSIS HERE (minimum 100 words on RAG vs No-RAG faithfulness patterns, pipeline failures)

In [None]:
# Autograder test cell — do not modify
assert 'my_faith_norag' in dir(), "You need to define 'my_faith_norag'"
assert 'my_faith_dense' in dir(), "You need to define 'my_faith_dense'"
assert 'my_relev_norag' in dir(), "You need to define 'my_relev_norag'"
assert 'my_relev_dense' in dir(), "You need to define 'my_relev_dense'"
assert isinstance(my_faith_norag, float), "my_faith_norag should be a float"
assert isinstance(my_faith_dense, float), "my_faith_dense should be a float"
assert isinstance(my_relev_norag, float), "my_relev_norag should be a float"
assert isinstance(my_relev_dense, float), "my_relev_dense should be a float"
assert 0 <= my_faith_norag <= 1 and 0 <= my_faith_dense <= 1
assert 0 <= my_relev_norag <= 1 and 0 <= my_relev_dense <= 1
print(f"{'Mode':<12s} {'Faithfulness':>13s} {'Relevancy':>10s}")
print("-" * 37)
print(f"{'No-RAG':<12s} {my_faith_norag:>13.4f} {my_relev_norag:>10.4f}")
print(f"{'Dense-RAG':<12s} {my_faith_dense:>13.4f} {my_relev_dense:>10.4f}")
print(f"\nAll auto-graded tests passed!")