# Manual RAG Pipeline: Mechanisms First

This notebook builds a Retrieval-Augmented Generation (RAG) pipeline from scratch.
You'll see every step explicitly before we move to frameworks like LangChain.

**Works on:** Google Colab, Local Jupyter (Mac/Windows/Linux)

**Pipeline Overview:**
```
Documents ‚Üí Chunking ‚Üí Embedding ‚Üí Index (FAISS)
                                        ‚Üì
User Query ‚Üí Embed Query ‚Üí Similarity Search ‚Üí Top-K Chunks
                                                    ‚Üì
                                        Prompt Assembly ‚Üí LLM ‚Üí Answer
```

## TODO ‚Äî Topic 5 RAG Course Project Checklist

- **Exercise 0:** Set-up ‚Äî Get notebook running; unzip Corpora.zip. Use PDFs from `Corpora/<corpus>/pdf_embedded/`.
- **Exercise 1:** Open model RAG vs no RAG ‚Äî Compare Qwen 2.5 1.5B with/without RAG on Model T manual and Congressional Record.
- **Exercise 2:** Open model + RAG vs large model ‚Äî Run GPT-4o Mini with no tools on same queries.
- **Exercise 3:** Open model + RAG vs frontier chat ‚Äî Compare local Qwen+RAG vs GPT-4/Claude (web).
- **Exercise 4:** Effect of top-K ‚Äî Test k = 1, 3, 5, 10, 20.
- **Exercise 5:** Unanswerable questions ‚Äî Off-topic, related-but-missing, false premise.
- **Exercise 6:** Query phrasing sensitivity ‚Äî Same question in 5+ phrasings.
- **Exercise 7:** Chunk overlap ‚Äî Re-chunk with overlap 0, 64, 128, 256.
- **Exercise 8:** Chunk size ‚Äî Chunk at 128, 256, 512, 1024, 2048.
- **Exercise 9:** Retrieval score analysis ‚Äî 10 queries, top-10 chunks, score distribution.
- **Exercise 10:** Prompt template variations ‚Äî Minimal, strict grounding, citation, permissive, structured.
- **Exercise 11:** Failure mode catalog ‚Äî Computation, temporal, comparison, ambiguous, multi-hop, etc.
- **Exercise 12:** Cross-document synthesis ‚Äî Questions needing multiple chunks.

## Setup

First, let's install the required packages and detect our compute environment.

In [None]:
# Install dependencies
# On Colab, these install quickly. Locally, you may already have them.
!pip install -q torch transformers sentence-transformers faiss-cpu pymupdf accelerate ipyfilechooser
# For Exercise 2 (GPT-4o Mini): !pip install -q openai


In [None]:
# =============================================================================
# ENVIRONMENT AND DEVICE DETECTION
# =============================================================================
import os
import sys

# Enable MPS fallback for any PyTorch operations not yet implemented on Metal
# This MUST be set before importing torch
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

import torch
from typing import Tuple

def detect_environment() -> str:
    """Detect if we're running on Colab or locally."""
    try:
        import google.colab
        return 'colab'
    except ImportError:
        return 'local'

def get_device() -> Tuple[str, torch.dtype]:
    """
    Detect the best available compute device.

    Priority: CUDA > MPS (Apple Silicon) > CPU

    Returns:
        Tuple of (device_string, recommended_dtype)

    Notes:
        - CUDA: Use float16 for memory efficiency (Tensor Cores optimize this)
        - MPS: Use float32 - Apple Silicon doesn't have the same float16
               optimizations as NVIDIA, and float32 is often faster
        - CPU: Use float32 (float16 not well supported on CPU)
    """
    if torch.cuda.is_available():
        device = 'cuda'
        dtype = torch.float16
        device_name = torch.cuda.get_device_name(0)
        memory_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"‚úì Using CUDA GPU: {device_name} ({memory_gb:.1f} GB)")

    elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
        device = 'mps'
        dtype = torch.float32  # float32 is often faster on Apple Silicon!
        print("‚úì Using Apple Silicon GPU (MPS)")
        print("  Note: Using float32 (faster than float16 on Apple Silicon)")

    else:
        device = 'cpu'
        dtype = torch.float32
        print("‚ö† Using CPU (no GPU detected)")
        print("  Tip: For faster processing, use a machine with a GPU")

    return device, dtype

# Detect environment and device
ENVIRONMENT = detect_environment()
DEVICE, DTYPE = get_device()

print(f"\nEnvironment: {ENVIRONMENT.upper()}")
print(f"Device: {DEVICE}, Dtype: {DTYPE}")

‚úì Using CUDA GPU: NVIDIA A100-SXM4-40GB (42.4 GB)

Environment: COLAB
Device: cuda, Dtype: torch.float16


## Load Your Documents

**Cell 1:** Configure your document source and select/upload files
- **Local Jupyter**: Use the folder picker, then run Cell 2
- **Colab + Upload**: Files upload immediately (blocking), then run Cell 2
- **Colab + Drive**: Set `USE_GOOGLE_DRIVE = True`, mounts Drive and shows picker, then run Cell 2

**Cell 2:** Confirms selection and lists documents

In [None]:
# =============================================================================
# CELL 1: SELECT DOCUMENT SOURCE
# =============================================================================
# This cell either:
#   - Shows a folder picker (Local or Colab+Drive) - NON-BLOCKING
#   - Shows an upload dialog (Colab+Upload) - BLOCKING
#
# If a folder picker is shown, SELECT YOUR FOLDER BEFORE running Cell 2.
# The picker widget is non-blocking, so the code continues before you select.
# =============================================================================
#path = /content/drive/MyDrive/Corpora/Corpora/ModelTService/txt and /content/drive/MyDrive/Corpora/Corpora/Congressional_Record_Jan_2026/txt
from pathlib import Path

# ------------- COLAB USERS: CONFIGURE HERE -------------
USE_GOOGLE_DRIVE = True  # Set to True to use Google Drive instead of uploading
# -------------------------------------------------------

# Default folder: use Corpora from course project (unzip Corpora.zip first).
_folder_default = Path("Corpora/Corpora/ModelTService/txt")
DOC_FOLDER = str(_folder_default) if _folder_default.exists() else "documents"
folder_chooser = None  # Will hold the picker widget if used

if ENVIRONMENT == 'colab':
    if USE_GOOGLE_DRIVE:
        # ----- COLAB + GOOGLE DRIVE -----
        # Mount Drive first, then show folder picker
        from google.colab import drive
        print("Mounting Google Drive...")
        drive.mount('/content/drive')
        print("‚úì Google Drive mounted\n")

        # Now show folder picker for the Drive
        try:
            from ipyfilechooser import FileChooser

            folder_chooser = FileChooser(
                path='/content/drive/MyDrive/Corpora/Corpora/ModelTService/txt',
                title='Select your documents folder in Google Drive',
                show_only_dirs=True,
                select_default=True
            )
            print("üìÅ Select your documents folder below, then run Cell 2:")
            print("   (The picker is non-blocking - select BEFORE running the next cell)")
            display(folder_chooser)

        except ImportError:
            # Fallback: manual path entry
            print("Folder picker not available.")
            print("Edit DOC_FOLDER below with your Google Drive path, then run Cell 2:")
            DOC_FOLDER = '/content/drive/MyDrive/your_documents_folder'  # ‚Üê Edit this!
            print(f"  DOC_FOLDER = '{DOC_FOLDER}'")
    else:
        # ----- COLAB + UPLOAD -----
        # Upload dialog blocks until complete, so DOC_FOLDER is ready when done
        from google.colab import files
        os.makedirs(DOC_FOLDER, exist_ok=True)

        print("Upload your documents (PDF, TXT, or MD):")
        print("(This dialog blocks until upload is complete)\n")
        uploaded = files.upload()

        for filename in uploaded.keys():
            os.rename(filename, f'{DOC_FOLDER}/{filename}')
            print(f"  ‚úì Saved: {DOC_FOLDER}/{filename}")

        print(f"\n‚úì Upload complete. Run Cell 2 to continue.")

else:
    # ----- LOCAL JUPYTER -----
    # Show folder picker
    print("Running locally\n")

    try:
        from ipyfilechooser import FileChooser

        folder_chooser = FileChooser(
            path=str(Path.home()),
            title='Select your documents folder',
            show_only_dirs=True,
            select_default=True
        )
        print("üìÅ Select your documents folder below, then run Cell 2:")
        print("   (The picker is non-blocking - select BEFORE running the next cell)")
        display(folder_chooser)

    except ImportError:
        # Fallback: manual path entry
        print("Folder picker not available (ipyfilechooser not installed).")
        print(f"\nUsing default folder: {Path(DOC_FOLDER).absolute()}")
        print("\nTo use a different folder, edit DOC_FOLDER in this cell:")
        print("  DOC_FOLDER = '/path/to/your/documents'")
        os.makedirs(DOC_FOLDER, exist_ok=True)

Mounting Google Drive...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úì Google Drive mounted

üìÅ Select your documents folder below, then run Cell 2:
   (The picker is non-blocking - select BEFORE running the next cell)


FileChooser(path='/content/drive/MyDrive/Corpora/Corpora/ModelTService/txt', filename='', title='Select your d‚Ä¶

In [None]:
# =============================================================================
# CELL 2: CONFIRM SELECTION AND LIST DOCUMENTS
# =============================================================================
# If you used a folder picker above, make sure you selected a folder
# BEFORE running this cell. The picker is non-blocking.
# =============================================================================

# Read selection from folder picker (if one was used)
if folder_chooser is not None and folder_chooser.selected_path:
    DOC_FOLDER = folder_chooser.selected_path
    print(f"‚úì Using selected folder: {DOC_FOLDER}")
elif folder_chooser is not None:
    print("‚ö† No folder selected in picker!")
    print("  Please go back to Cell 1, select a folder, then run this cell again.")
else:
    # No picker used (upload or manual path)
    print(f"‚úì Using folder: {DOC_FOLDER}")

# Confirm folder (listing skipped for speed)
doc_path = Path(DOC_FOLDER)
if doc_path.exists():
    print(f"‚úì Folder set: {doc_path.absolute()}")
    print("  Run the next cells to load, chunk, and index documents.")
else:
    print(f"‚ö† Folder not found: {DOC_FOLDER}")
    print("  Please set DOC_FOLDER in the previous cell and run it again.")

‚úì Using selected folder: /content/drive/MyDrive/Corpora/Corpora/ModelTService/txt
‚úì Folder set: /content/drive/MyDrive/Corpora/Corpora/ModelTService/txt
  Run the next cells to load, chunk, and index documents.


---
## Stage 1: Document Loading

We need to extract text from our documents. For PDFs with embedded text,
PyMuPDF (fitz) reads the text layer directly - no OCR needed.

**Corpora:** Use PDFs from `Corpora/<name>/pdf_embedded/`. The `.txt` files in `txt/` are for checking retrieval vs OCR issues.

In [None]:
# Exercise 1 (and reuse): Official query lists. Reference: CR Jan 13, 20, 21, 23, 2026.
QUERIES_MODEL_T = [
    "How do I adjust the carburetor on a Model T?",
    "What is the correct spark plug gap for a Model T Ford?",
    "How do I fix a slipping transmission band?",
    "What oil should I use in a Model T engine?",
]
QUERIES_CR = [
    "What did Mr. Flood have to say about Mayor David Black in Congress on January 13, 2026?",
    "What mistake did Elise Stefanik make in Congress on January 23, 2026?",
    "What is the purpose of the Main Street Parity Act?",
    "Who in Congress has spoken for and against funding of pregnancy centers?",
]

In [None]:
import fitz  # PyMuPDF
from typing import List, Tuple

def load_text_file(filepath: str) -> str:
    """Load a plain text file."""
    with open(filepath, 'r', encoding='utf-8', errors='ignore') as f:
        return f.read()


def load_pdf_file(filepath: str) -> str:
    """
    Extract text from a PDF with embedded text.

    PyMuPDF reads the text layer directly.
    For scanned PDFs without embedded text, you'd need OCR.
    """
    doc = fitz.open(filepath)
    text_parts = []

    for page_num, page in enumerate(doc):
        text = page.get_text()
        if text.strip():
            # Add page marker for debugging/citation
            text_parts.append(f"\n[Page {page_num + 1}]\n{text}")

    doc.close()
    return "\n".join(text_parts)


def load_documents(doc_folder: str) -> List[Tuple[str, str]]:
    """Load all documents from a folder. Returns list of (filename, content)."""
    documents = []
    folder = Path(doc_folder)

    for filepath in folder.rglob("*"):
        try:
            if not filepath.is_file():
                continue
        except OSError:
            continue
        if filepath.suffix.lower() not in ('.pdf', '.txt', '.md', '.text'):
            continue
        try:
          if filepath.suffix.lower() == '.pdf':
            content = load_pdf_file(str(filepath))
          elif filepath.suffix.lower() in ['.txt', '.md', '.text']:
            content = load_text_file(str(filepath))
          else:
            continue

          if content.strip():
            documents.append((filepath.name, content))
            print(f"‚úì Loaded: {filepath.name} ({len(content):,} chars)")
        except Exception as e:
            print(f"‚úó Error loading {filepath}: {e}")

    return documents

In [None]:
# Load your documents
documents = load_documents(DOC_FOLDER)
print(f"\nLoaded {len(documents)} documents")

if len(documents) == 0:
    print("\n‚ö† No documents loaded! Please add PDF or TXT files to the documents folder.")

‚úì Loaded: ModelT-41-50.txt (14,264 chars)
‚úì Loaded: ModelT-51-60.txt (14,168 chars)
‚úì Loaded: ModelT-21-30.txt (17,050 chars)
‚úì Loaded: ModelT-11-20.txt (19,009 chars)
‚úì Loaded: ModelT-01-10.txt (18,676 chars)
‚úì Loaded: ModelT-31-40.txt (12,194 chars)
‚úì Loaded: Ford-Model-T-Man-1919.txt (95,574 chars)
‚úì Loaded: ModelT-61-62.txt (201 chars)

Loaded 8 documents


In [None]:
# Inspect a document to verify loading worked
if documents:
    filename, content = documents[0]
    print(f"First document: {filename}")
    print(f"Total length: {len(content):,} characters")
    print(f"\nFirst 1000 characters:\n{'-'*40}")
    print(content[:1000])

First document: ModelT-41-50.txt
Total length: 14,264 characters

First 1000 characters:
----------------------------------------
What about the * ‚Äò+ ‚Äò
Wheels?

. pijabvloniie should, after packer i come to rest with the tire valve ety: be
the hub. Undue wear of the hub bearings is usually caused by eee oft 28 -
cation and excessive friction, due to the adjusting cone being screwed up ea
tight. It is a good plan to clean the bearings frequently and keep the -

; filled with grease.

How are'the
I heels removed? Answer Ne. 8

Front wheels: Take off hub remove cotter pin

wheel removed. Care should a ie taken to oct that the Kk ROB
replaced on the same spindle from which they. were removed, otherwise. thd
a liability of stripping the threads which are feft on the left spindle end rightop'

opposite as you stand facing the car. Rear wheels: They should not be remg
unless: absolutely necessary‚Äîin which case proceed as abdve, then wi hal woe
puller remove the wheel from the tapered sh

---
## Stage 2: Chunking

Documents need to be split into pieces small enough to be relevant but large enough to carry meaning.

**Why overlap?** If a key sentence sits right at a chunk boundary, splitting without overlap might cut it in half. Overlap ensures that information near boundaries appears intact in at least one chunk.

**Experiment:** Try different chunk sizes (256, 512, 1024) and see how it affects retrieval!

In [None]:
from dataclasses import dataclass

@dataclass
class Chunk:
    """A chunk of text with metadata for tracing back to source."""
    text: str
    source_file: str
    chunk_index: int
    start_char: int
    end_char: int


def chunk_text(
    text: str,
    source_file: str,
    chunk_size: int = 512,
    chunk_overlap: int = 128
) -> List[Chunk]:
    """
    Split text into overlapping chunks.

    We try to break at sentence or paragraph boundaries
    to avoid cutting mid-thought.
    """
    chunks = []
    start = 0
    chunk_index = 0

    while start < len(text):
        end = start + chunk_size

        # Try to break at a good boundary
        if end < len(text):
            # Look for paragraph break first
            para_break = text.rfind('\n\n', start + chunk_size // 2, end)
            if para_break != -1:
                end = para_break + 2
            else:
                # Look for sentence break
                sentence_break = text.rfind('. ', start + chunk_size // 2, end)
                if sentence_break != -1:
                    end = sentence_break + 2

        chunk_text_str = text[start:end].strip()

        if chunk_text_str:
            chunks.append(Chunk(
                text=chunk_text_str,
                source_file=source_file,
                chunk_index=chunk_index,
                start_char=start,
                end_char=end
            ))
            chunk_index += 1

        # Move forward, accounting for overlap
        start = end - chunk_overlap
        if chunks and start <= chunks[-1].start_char:
            start = end  # Safety: ensure progress

    return chunks

In [None]:
# ============================================
# EXPERIMENT: Try different chunk sizes!
# ============================================
CHUNK_SIZE = 512      # Try: 256, 512, 1024
CHUNK_OVERLAP = 128   # Try: 64, 128, 256
# For Ex 7/8 use rebuild_pipeline() ‚Äî see cell after FAISS index.

# Chunk all documents
all_chunks = []
for filename, content in documents:
    doc_chunks = chunk_text(content, filename, CHUNK_SIZE, CHUNK_OVERLAP)
    all_chunks.extend(doc_chunks)
    print(f"{filename}: {len(doc_chunks)} chunks")

print(f"\nTotal: {len(all_chunks)} chunks")

ModelT-41-50.txt: 51 chunks
ModelT-51-60.txt: 46 chunks
ModelT-21-30.txt: 56 chunks
ModelT-11-20.txt: 66 chunks
ModelT-01-10.txt: 64 chunks
ModelT-31-40.txt: 44 chunks
Ford-Model-T-Man-1919.txt: 326 chunks
ModelT-61-62.txt: 1 chunks

Total: 654 chunks


In [None]:
# Inspect some chunks
if all_chunks:
    print("Sample chunks:")
    indices_to_show = [0, len(all_chunks)//2, -1] if len(all_chunks) > 2 else range(len(all_chunks))
    for i in indices_to_show:
        chunk = all_chunks[i]
        print(f"\n{'='*60}")
        print(f"Chunk {chunk.chunk_index} from {chunk.source_file}")
        print(f"{'='*60}")
        print(chunk.text[:300] + "..." if len(chunk.text) > 300 else chunk.text)

Sample chunks:

Chunk 0 from ModelT-41-50.txt
What about the * ‚Äò+ ‚Äò
Wheels?

. pijabvloniie should, after packer i come to rest with the tire valve ety: be
the hub. Undue wear of the hub bearings is usually caused by eee oft 28 -
cation and excessive friction, due to the adjusting cone being screwed up ea
tight. It is a good plan to clean the b...

Chunk 0 from Ford-Model-T-Man-1919.txt
For Owners and Operators

of Ford Cars and Trucks

Published by J
FORD MOTOR. COMPANY
Detroit, Michigan, U.S. A. .



Ss Foreword

' is a significant fact that nearly all Ford cara are driven by
laymen‚Äîby owners, who in the. great majority of cases have
FOr no practical experience with things mechan...

Chunk 0 from ModelT-61-62.txt
What It Is For
How to Disconnect

OPERATION
Alustments‚ÄîWhat to Do vere... cong

Model T Truck

‚ÄòWorm, How Removed rar bo eee ne ceeaee ba eeeeeces reer er rrr

The Ford Starting and Highting System


---
## Stage 3: Embedding

Embeddings map text to dense vectors where **semantic similarity = geometric proximity**.

A sentence about "cardiac arrest" and one about "heart attack" will have similar embeddings even though they share no words.

**Note:** sentence-transformers does NOT auto-detect Apple MPS - we must pass the device explicitly.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load embedding model
# Options:
# - "sentence-transformers/all-MiniLM-L6-v2": Fast, small (80MB), good quality
# - "BAAI/bge-small-en-v1.5": Better for retrieval, similar size

EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

print(f"Loading embedding model: {EMBEDDING_MODEL}")
print(f"Device: {DEVICE}")

# Must explicitly pass device for MPS support!
embed_model = SentenceTransformer(EMBEDDING_MODEL, device=DEVICE)
EMBEDDING_DIM = embed_model.get_sentence_embedding_dimension()
print(f"Embedding dimension: {EMBEDDING_DIM}")

Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
Device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Embedding dimension: 384


In [None]:
# DEMO: See how embeddings capture semantic similarity
test_sentences = [
    "The engine needs regular oil changes.",
    "Motor oil should be replaced periodically.",
    "The Senate convened at noon.",
    "Congress began its session at midday."
]

test_embeddings = embed_model.encode(test_sentences)

# Compute cosine similarity matrix
from numpy.linalg import norm

def cosine_sim(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

print("Cosine similarity matrix:")
print("\n" + " " * 40 + "  [0]    [1]    [2]    [3]")
for i, s1 in enumerate(test_sentences):
    sims = [cosine_sim(test_embeddings[i], test_embeddings[j]) for j in range(4)]
    print(f"[{i}] {s1[:35]:35} {sims[0]:.3f}  {sims[1]:.3f}  {sims[2]:.3f}  {sims[3]:.3f}")

print("\n‚Üí Notice: [0]-[1] are similar (both about oil), [2]-[3] are similar (both about Congress)")

Cosine similarity matrix:

                                          [0]    [1]    [2]    [3]
[0] The engine needs regular oil change 1.000  0.728  -0.045  -0.032
[1] Motor oil should be replaced period 0.728  1.000  0.014  0.035
[2] The Senate convened at noon.        -0.045  0.014  1.000  0.684
[3] Congress began its session at midda -0.032  0.035  0.684  1.000

‚Üí Notice: [0]-[1] are similar (both about oil), [2]-[3] are similar (both about Congress)


In [None]:
# Embed all chunks - this may take a few minutes for large corpora
if all_chunks:
    print(f"Embedding {len(all_chunks)} chunks on {DEVICE}...")
    chunk_texts = [c.text for c in all_chunks]
    chunk_embeddings = embed_model.encode(chunk_texts, show_progress_bar=True)
    chunk_embeddings = chunk_embeddings.astype('float32')  # FAISS wants float32
    print(f"Embeddings shape: {chunk_embeddings.shape}")
else:
    print("No chunks to embed - please load documents first.")

Embedding 654 chunks on cuda...


Batches:   0%|          | 0/21 [00:00<?, ?it/s]

Embeddings shape: (654, 384)


---
## Stage 4: Vector Index (FAISS)

FAISS efficiently finds nearest neighbors in high-dimensional spaces.

We use a simple **flat index** (brute-force search) which is transparent and works well for up to ~100k vectors. For larger corpora, you'd use approximate methods like IVF or HNSW.

**Note:** FAISS GPU support is CUDA-only. On MPS/CPU, we use faiss-cpu (still very fast for <100k vectors).

In [None]:
import faiss

# Create FAISS index
# IndexFlatIP = Inner Product (for cosine similarity on normalized vectors)
index = faiss.IndexFlatIP(EMBEDDING_DIM)

if all_chunks:
    # Normalize vectors so inner product = cosine similarity
    faiss.normalize_L2(chunk_embeddings)

    # Add vectors to index
    index.add(chunk_embeddings)
    print(f"Index built with {index.ntotal} vectors")
else:
    print("No embeddings to index - please load and embed documents first.")

Index built with 654 vectors


---
## Stage 5: Retrieval

Now we can search! Given a query, we:
1. Embed the query with the same model
2. Find the top-k most similar chunks
3. Return those chunks as context

In [None]:
# Helper for Exercises 7 & 8: rebuild chunks + index with different chunk_size / chunk_overlap.
def rebuild_pipeline(chunk_size: int = 512, chunk_overlap: int = 128):
    """Re-chunk documents, re-embed, and rebuild FAISS index. Updates global all_chunks and index."""
    global all_chunks, index
    all_chunks = []
    for filename, content in documents:
        all_chunks.extend(chunk_text(content, filename, chunk_size=chunk_size, chunk_overlap=chunk_overlap))
    chunk_embeddings = embed_model.encode([c.text for c in all_chunks], show_progress_bar=True).astype("float32")
    faiss.normalize_L2(chunk_embeddings)
    index = faiss.IndexFlatIP(EMBEDDING_DIM)
    index.add(chunk_embeddings)
    print(f"Rebuilt: {len(all_chunks)} chunks, chunk_size={chunk_size}, chunk_overlap={chunk_overlap}")

In [None]:
def retrieve(query: str, top_k: int = 5):
    """
    Retrieve the top-k most relevant chunks for a query.

    Returns: List of (chunk, similarity_score) tuples
    """
    # Embed the query
    query_embedding = embed_model.encode([query]).astype('float32')
    faiss.normalize_L2(query_embedding)

    # Search
    scores, indices = index.search(query_embedding, top_k)

    results = []
    for score, idx in zip(scores[0], indices[0]):
        if idx != -1:
            results.append((all_chunks[idx], float(score)))

    return results

In [None]:
# Test retrieval
# ============================================
# TRY DIFFERENT QUERIES FOR YOUR CORPUS!
# ============================================
test_query = "What oil should I use in a Model T engine?" # ‚Üê Modify this!

if index.ntotal > 0:
    results = retrieve(test_query, top_k=5)

    print(f"Query: {test_query}\n")
    print("Top 5 retrieved chunks:")
    for i, (chunk, score) in enumerate(results, 1):
        print(f"\n[{i}] Score: {score:.4f} | Source: {chunk.source_file}")
        print(f"    {chunk.text[:200]}...")
else:
    print("Index is empty - please load, chunk, and embed documents first.")

Query: What oil should I use in a Model T engine?

Top 5 retrieved chunks:

[1] Score: 0.3775 | Source: Ford-Model-T-Man-1919.txt
    cting Rad
Cam Shaft Rear Bearing
Crank Salt Rear Bearing

Sectional View of the Model T Motor.

(Cut No. 2)

.
;

Terminal

ie

| Sree Cm Rit
: Staring Crank Spring
Searing Crank Steeve

Camm Stadt Fr...

[2] Score: 0.3756 | Source: Ford-Model-T-Man-1919.txt
    casing (
reservoir which holds this oil) you will find two pet cocks. Pour off in sl√©wly -
until it runs out of the upper cock. Leave the cock open until it stopd running‚Äî
then close it. ‚ÄòAfter the en...

[3] Score: 0.3756 | Source: ModelT-01-10.txt
    casing (
reservoir which holds this oil) you will find two pet cocks. Pour off in sl√©wly -
until it runs out of the upper cock. Leave the cock open until it stopd running‚Äî
then close it. ‚ÄòAfter the en...

[4] Score: 0.3746 | Source: ModelT-01-10.txt
    cting Rad
Cam Shaft Rear Bearing
Crank Salt Rear Bearing

Sectional View of the Model

---
## Stage 6: Generation (LLM)

Now we load a local LLM to generate answers from the retrieved context.

**Recommended models:**
- `Qwen/Qwen2.5-1.5B-Instruct` - Best instruction following at this size
- `Qwen/Qwen2.5-3B-Instruct` - Even better if you have 8GB+ VRAM
- `meta-llama/Llama-3.2-1B-Instruct` - Alternative, slightly weaker

**Device handling:**
- CUDA: Uses `device_map="auto"` and float16
- MPS: Loads to CPU first, then moves to MPS with float32
- CPU: Uses float32 (slower but works)

In [None]:
from google.colab import userdata
import os
# Make it available like system env vars
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
print("Env var set:", bool(os.environ.get("OPENAI_API_KEY")))

Env var set: True


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# ============================================
# CHOOSE YOUR MODEL
# ============================================
LLM_MODEL = "Qwen/Qwen2.5-1.5B-Instruct"  # Or try "Qwen/Qwen2.5-3B-Instruct"
#model = "gpt-4o-mini"
print(f"Loading LLM: {LLM_MODEL}")
print(f"Device: {DEVICE}, Dtype: {DTYPE}")
print("This may take a few minutes on first run...\n")

tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL)

# Load with appropriate settings for each device type
if DEVICE == 'cuda':
    model = AutoModelForCausalLM.from_pretrained(
         LLM_MODEL,
        device_map="auto",
        torch_dtype=DTYPE,
        trust_remote_code=True
    )
    print("Model loaded on CUDA")

elif DEVICE == 'mps':
    # For MPS, load to CPU first, then move to MPS
    # (device_map="auto" doesn't work well with MPS)
    model = AutoModelForCausalLM.from_pretrained(
        LLM_MODEL,
        torch_dtype=DTYPE,
        trust_remote_code=True
    )
    model = model.to(DEVICE)
    print("Model loaded on MPS (Apple Silicon)")

else:
    # CPU
    model = AutoModelForCausalLM.from_pretrained(
        LLM_MODEL,
        torch_dtype=DTYPE,
        trust_remote_code=True
    )
    print("Model loaded on CPU (this will be slow)")

Loading LLM: Qwen/Qwen2.5-1.5B-Instruct
Device: cuda, Dtype: torch.float16
This may take a few minutes on first run...



`torch_dtype` is deprecated! Use `dtype` instead!


Loading weights:   0%|          | 0/338 [00:00<?, ?it/s]

Model loaded on CUDA


In [None]:
def generate_response(prompt: str, max_new_tokens: int = 512, temperature: float = 0.3) -> str:
    """
    Generate a response from the LLM.

    Lower temperature = more focused/deterministic
    Higher temperature = more creative/random
    """
    inputs = tokenizer(prompt, return_tensors="pt")

    # Move inputs to the correct device
    if DEVICE == 'cuda':
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
    else:
        inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True if temperature > 0 else False,
            pad_token_id=tokenizer.eos_token_id
        )

    # Decode only the new tokens
    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )

    return response.strip()

---
## Stage 7: The Complete RAG Pipeline

Now we put it all together. The **prompt template** is critical - it must instruct the model to use the retrieved context.

In [None]:
# The RAG prompt template
PROMPT_TEMPLATE = """You are a helpful assistant that answers questions based on the provided context.

CONTEXT:
{context}

QUESTION: {question}

INSTRUCTIONS:
- Answer the question based ONLY on the information in the context above
- If the context doesn't contain enough information to answer, say so
- Quote relevant parts of the context to support your answer
- Be concise and direct

ANSWER:"""


def direct_query(question: str, max_new_tokens: int = 512) -> str:
    """Ask the LLM directly with no retrieved context (for RAG vs no-RAG comparison)."""
    prompt = f"""Answer this question:
{question}

Answer:"""
    return generate_response(prompt, max_new_tokens=max_new_tokens)

def rag_query(question: str, top_k: int = 5, show_context: bool = False, prompt_template: str = None) -> str:
    """The complete RAG pipeline. prompt_template: custom template for Exercise 10."""
    # Step 1: Retrieve
    results = retrieve(question, top_k)

    # Format context
    context_parts = []
    for chunk, score in results:
        context_parts.append(f"[Source: {chunk.source_file}, Relevance: {score:.3f}]\n{chunk.text}")
    context = "\n\n---\n\n".join(context_parts)

    if show_context:
        print("=" * 60)
        print("RETRIEVED CONTEXT:")
        print("=" * 60)
        print(context)
        print("=" * 60 + "\n")

    # Step 2: Build prompt (use custom template if provided)
    template = prompt_template if prompt_template is not None else PROMPT_TEMPLATE
    prompt = template.format(context=context, question=question)

    # Step 3: Generate
    answer = generate_response(prompt)

    return answer

In [None]:
# ============================================
# TEST YOUR RAG PIPELINE!
# ============================================

question = "What did Mr. Flood have to say about Mayor David Black in Congress on January 13, 2026?"  # ‚Üê Modify for your corpus!

if index.ntotal > 0:
    print(f"Question: {question}\n")
    print("Generating answer...\n")

    answer = rag_query(question, top_k=5, show_context=True)

    print("ANSWER:")
    print(answer)
else:
    print("Pipeline not ready - please complete all previous stages first.")

Question: What did Mr. Flood have to say about Mayor David Black in Congress on January 13, 2026?

Generating answer...

RETRIEVED CONTEXT:
[Source: ModelT-41-50.txt, Relevance: 0.139]
nd

---

[Source: Ford-Model-T-Man-1919.txt, Relevance: 0.134]
ully retarded the adjusting screw should be

screwed in un‚Äô it strikes the bose, preventing the throttle from closing too far.

When proper adjustment has been made, tighten lock-screw so that adjustment
will not be disturbed.

What is the
the Hot dir Pipa? a Answer No. ¬ß0

It takes the hot air from around the exhaust pipe and conducts it to the catbure-

tor‚Äîwhere the heat facilitates the yaporiging:of the gasoline. It is usually advis- ‚Äò

.-

te 8 Pel - uJ

‚Äò
4
d
‚Äò
&

t
√©

---

[Source: ModelT-21-30.txt, Relevance: 0.134]
ully retarded the adjusting screw should be

screwed in un‚Äô it strikes the bose, preventing the throttle from closing too far.

When proper adjustment has been made, tighten lock-screw so that adjustment
will

---
## Experiments: Understanding RAG Behavior

Now that you have a working pipeline, try these experiments to understand how each component affects the results.

In [None]:
# Exercise 7 - OVERLAP EXPERIMENT

import time
import textwrap

# Experiment params
CHUNK_SIZE = 512
OVERLAPS = [0, 64, 128, 256]
TOP_K = 5

# Use the fixed question you want to test (keep as-is)
QUESTION = "How do I tighten the brake and reverse bands?"
summary_rows = []

print(f"\nQuestion: {QUESTION}\n")
for overlap in OVERLAPS:
    print("\n" + "="*72)
    print(f"Overlap: {overlap}")
    print("="*72)
    t0 = time.time()

    # Rebuild chunks + index for this overlap (updates global all_chunks and index)
    rebuild_pipeline(chunk_size=CHUNK_SIZE, chunk_overlap=overlap)

    elapsed_rechunk = time.time() - t0
    num_chunks = len(all_chunks)
    index_size = getattr(index, "ntotal", num_chunks)

    # Retrieve top-K for question
    retrieved = retrieve(QUESTION, top_k=TOP_K)
    retrieved = list(retrieved)[:TOP_K]

    # Print the formatted context block (mimicking your example)

    for i, (chunk, score) in enumerate(retrieved, start=1):
        excerpt = chunk.text.replace("\n", " ").strip()
        if len(excerpt) > 250:
            excerpt = excerpt[:250].rstrip() + "..."
        print(f"[Source: {chunk.source_file}, Relevance: {score:.3f}]")
        print(textwrap.fill(excerpt, width=100))
        print("\n---\n")


    summary_rows.append({
        "overlap": overlap,
        "chunks": num_chunks,
        "index_size": index_size,
        "retrieved_count": len(retrieved),
        "rechunk_time_s": round(elapsed_rechunk, 1)
    })

# Print comparison table
print("\n" + "="*72)
print("COMPARISON TABLE")
print("="*72)
header = f"| {'Overlap':^7} | {'Chunks':^7} | {'Index Size':^10} | {'Retrieved':^9} | {'Rechunk(s)':^11} |"
print(header)
print("|" + "-"*(len(header)-2) + "|")
for r in summary_rows:
    print(f"| {r['overlap']:^7} | {r['chunks']:^7} | {r['index_size']:^10} | {r['retrieved_count']:^9} | {r['rechunk_time_s']:^11} |")
print("="*72)



Question: How do I tighten the brake and reverse bands?


Overlap: 0


Batches:   0%|          | 0/15 [00:00<?, ?it/s]

Rebuilt: 449 chunks, chunk_size=512, chunk_overlap=0
[Source: Ford-Model-T-Man-1919.txt, Relevance: 0.664]
The slow.speed band.may be tightened by loosening the lock nut at the tight side of the transmission
cover, and turning the adjusting screw (see Cut No. 12)'t√© tf  right. To tighten the brake and
reverse bands remove the transmission cover  doof and...

---

[Source: ModelT-31-40.txt, Relevance: 0.664]
The slow.speed band.may be tightened by loosening the lock nut at the tight side of the transmission
cover, and turning the adjusting screw (see Cut No. 12)'t√© tf  right. To tighten the brake and
reverse bands remove the transmission cover  doof and...

---

[Source: Ford-Model-T-Man-1919.txt, Relevance: 0.508]
getting the. pedal shafts to rest in the notches in the band ears. The clutch release ring must Be
placed in the rear groove of the clutch shift, With the cover in-place remove the cond which held
the bands in place while the cover was being installe...

---

[Source: Model

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Rebuilt: 524 chunks, chunk_size=512, chunk_overlap=64
[Source: Ford-Model-T-Man-1919.txt, Relevance: 0.774]
cover, and turning the adjusting screw (see Cut No. 12)'t√© tf  right. To tighten the brake and
reverse bands remove the transmission cover  doof and turn the adjusting nuts on the shafts to the
right. See that the bands do : not dragon the drums when...

---

[Source: ModelT-31-40.txt, Relevance: 0.774]
cover, and turning the adjusting screw (see Cut No. 12)'t√© tf  right. To tighten the brake and
reverse bands remove the transmission cover  doof and turn the adjusting nuts on the shafts to the
right. See that the bands do : not dragon the drums when...

---

[Source: Ford-Model-T-Man-1919.txt, Relevance: 0.592]
dis, rather than by turning the adjusting screws in too far. ;  CAUTION: Let us warn you against
placing any small! toala or objects over or'ip the transmission case without a good wire or cord
attached to them. Otherwise if they are dropped into the...

---

[Source: Mo

Batches:   0%|          | 0/21 [00:00<?, ?it/s]

Rebuilt: 654 chunks, chunk_size=512, chunk_overlap=128
[Source: Ford-Model-T-Man-1919.txt, Relevance: 0.778]
urning the adjusting screw (see Cut No. 12)'t√© tf  right. To tighten the brake and reverse bands
remove the transmission cover  doof and turn the adjusting nuts on the shafts to the right. See that
the bands do : not dragon the drums when disengaged,...

---

[Source: ModelT-31-40.txt, Relevance: 0.778]
urning the adjusting screw (see Cut No. 12)'t√© tf  right. To tighten the brake and reverse bands
remove the transmission cover  doof and turn the adjusting nuts on the shafts to the right. See that
the bands do : not dragon the drums when disengaged,...

---

[Source: Ford-Model-T-Man-1919.txt, Relevance: 0.620]
ise if they are dropped into the transmission case it is almost impossible to recover them  ‚Äò√©e
Bands adjusted? Answer No. TH  The slow.speed band.may be tightened by loosening the lock nut at the
tight side of the transmission cover, and turning th...

---

[Source:

Batches:   0%|          | 0/32 [00:00<?, ?it/s]

Rebuilt: 1017 chunks, chunk_size=512, chunk_overlap=256
[Source: Ford-Model-T-Man-1919.txt, Relevance: 0.766]
To tighten the brake and reverse bands remove the transmission cover  doof and turn the adjusting
nuts on the shafts to the right. See that the bands do : not dragon the drums when disengaged, as
they exert 2 brake effect, and tend to overheat the mo...

---

[Source: ModelT-31-40.txt, Relevance: 0.766]
To tighten the brake and reverse bands remove the transmission cover  doof and turn the adjusting
nuts on the shafts to the right. See that the bands do : not dragon the drums when disengaged, as
they exert 2 brake effect, and tend to overheat the mo...

---

[Source: Ford-Model-T-Man-1919.txt, Relevance: 0.647]
nsmission case it is almost impossible to recover them  ‚Äò√©e  Bands adjusted? Answer No. TH  The
slow.speed band.may be tightened by loosening the lock nut at the tight side of the transmission
cover, and turning the adjusting screw (see Cut No. 12)'t...

---

[Source:

---
## Save/Load Your Index

For large corpora, you don't want to re-embed every time. Here's how to persist the index.

In [None]:
import pickle

def save_index(filepath: str):
    """Save FAISS index and chunks to disk."""
    faiss.write_index(index, f"{filepath}.faiss")
    with open(f"{filepath}.chunks", 'wb') as f:
        pickle.dump(all_chunks, f)
    print(f"‚úì Saved index to {filepath}.faiss")
    print(f"‚úì Saved chunks to {filepath}.chunks")

def load_saved_index(filepath: str):
    """Load FAISS index and chunks from disk."""
    global index, all_chunks
    index = faiss.read_index(f"{filepath}.faiss")
    with open(f"{filepath}.chunks", 'rb') as f:
        all_chunks = pickle.load(f)
    print(f"‚úì Loaded index with {index.ntotal} vectors")

# Save your index
if index.ntotal > 0:
    save_index("my_rag_index")
else:
    print("No index to save.")

# Later, to load:
# load_saved_index("my_rag_index")

‚úì Saved index to my_rag_index.faiss
‚úì Saved chunks to my_rag_index.chunks


---
## Next Steps

You've built a complete RAG pipeline from scratch! In the next class, we'll:

1. **Improve retrieval** with query rewriting and hybrid search
2. **Rebuild with LangChain** to see how frameworks abstract these steps
3. **Evaluate systematically** with test questions and metrics

### Exercises to try:
- Vary chunk size (256, 512, 1024) and measure retrieval quality
- Try a different embedding model (`BAAI/bge-small-en-v1.5`)
- Try a larger LLM (`Qwen/Qwen2.5-3B-Instruct`) and compare answer quality
- Ask questions that require combining information from multiple chunks

---
## Appendix: Device Information

Run this cell to see detailed information about your compute environment.

In [None]:
def print_device_info():
    """Print detailed information about available compute devices."""
    print("=" * 60)
    print("DEVICE INFORMATION")
    print("=" * 60)

    print(f"\nEnvironment: {ENVIRONMENT}")
    print(f"PyTorch version: {torch.__version__}")

    # CUDA
    print(f"\nCUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"  Device: {torch.cuda.get_device_name(0)}")
        print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

    # MPS
    print(f"\nMPS available: {torch.backends.mps.is_available()}")
    print(f"MPS built: {torch.backends.mps.is_built()}")

    # Current selection
    print(f"\n‚Üí Selected device: {DEVICE}")
    print(f"‚Üí Selected dtype: {DTYPE}")
    print("=" * 60)

print_device_info()