<a href="https://colab.research.google.com/github/KartikayBhardwaj-dev/Deep_learning_college/blob/main/RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Building Retrieval-Augmented Generation (RAG) Systems: A Comprehensive Guide**

**Course:** Deep Learning and Generative AI  
**Institution:** IIT Madras  
**Author:** Prof. Balaji Srinivasan  
**Date:** November 2025  

---

## **Learning Objectives**

By the end of this tutorial, students will be able to:

1. **Understand** the fundamental architecture and components of RAG systems
2. **Implement** document processing pipelines for extracting text from PDFs
3. **Build** vector databases using embeddings for semantic search
4. **Create** retrieval systems that find relevant context for user queries
5. **Integrate** large language models with retrieval systems
6. **Develop** complete question-answering systems using LangChain
7. **Apply** best practices for chunking, embedding, and prompt engineering

---

## **Prerequisites**

- Strong understanding of Python programming and object-oriented concepts
- Familiarity with natural language processing fundamentals
- Knowledge of embeddings and vector similarity concepts
- Understanding of prompt engineering for large language models
- Basic experience with APIs and environment variables
- Familiarity with text processing and regular expressions

---

## **1. Overview of RAG Systems**

`### **1.1 Introduction to Retrieval-Augmented Generation**

Retrieval-Augmented Generation (RAG) combines the power of large language models with external knowledge retrieval to produce accurate, contextually relevant answers. Unlike pure LLMs that rely solely on their training data, RAG systems:

- **Retrieve** relevant documents from external knowledge bases
- **Augment** prompts with retrieved context
- **Generate** responses grounded in specific source material
- **Reduce hallucinations** by constraining answers to provided context

### **1.2 Key Components of RAG Architecture**

| **Component** | **Purpose** | **Implementation** |
|---------------|-------------|-------------------|
| **Document Loader** | Extract text from various sources | PDF readers, web scrapers, database connectors |
| **Text Splitter** | Break documents into manageable chunks | Recursive character splitting with overlap |
| **Embedding Model** | Convert text to vector representations | Google text-embedding-004 |
| **Vector Store** | Enable similarity search over embeddings | FAISS (Facebook AI Similarity Search) |
| **Retriever** | Find relevant context for queries | Top-k similarity search |
| **Language Model** | Generate natural language responses | Gemini 2.5 Pro via Google AI API |
| **Prompt Template** | Structure queries with context | LangChain prompt templates |


## **2. Environment Setup and Dependencies**

- **pypdf**: For extracting text from PDF documents
- **langchain**: Framework for building LLM applications
- **langchain-google-genai**: Google Gemini integrations for LangChain
- **google-genai**: Google Generative AI Python SDK
- **langchain-community**: Community-maintained integrations
- **python-dotenv**: For managing API keys and environment variables

In [None]:
!pip install pypdf langchain langchain-google-genai langchain-community python-dotenv google-genai gdown faiss-cpu

Collecting pypdf
  Downloading pypdf-6.4.0-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-google-genai
  Downloading langchain_google_genai-3.2.0-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.7 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain-google-genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<1.0.0,>=0.9.0 (from langchain-google-genai)
  Downloading google_ai_generativelanguage-0.9.0-py3-none-any.whl.metadata (10 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain-community)
  Downloading langchain_classic-1.0.0-py3-none-any.whl.metadata (3.9 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain-community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collect

In [None]:
# Core Python libraries
import os
import re
from typing import List
import gdown

# PDF processing - we'll use pypdf instead of fitz
from pypdf import PdfReader

# LangChain components for our RAG system
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_community.docstore.document import Document
#from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Environment management
from dotenv import load_dotenv

# Load your API keys
load_dotenv()
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

In [None]:
from google.colab import auth
auth.authenticate_user()

## **3. Document Processing Pipeline**

### **3.1 Loading PDF Documents**

The first step in building a RAG system is loading documents from external sources. We use pypdf to:
- Read PDF files page by page
- Extract raw text content
- Handle various PDF encodings and formats

In [None]:
PDF_URL = 'https://drive.google.com/uc?id=1w8ZHRrG5g0lGACvdlBL11bR1DaPldWRA'
PDF_PATH = 'hands_on_ml.pdf'

gdown.download(PDF_URL, PDF_PATH, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1w8ZHRrG5g0lGACvdlBL11bR1DaPldWRA
To: /content/hands_on_ml.pdf
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 58.5M/58.5M [00:00<00:00, 75.8MB/s]


'hands_on_ml.pdf'

In [None]:
# Read the PDF and extract all text
pdf_reader = PdfReader(PDF_PATH)
print(f"PDF loaded with {len(pdf_reader.pages)} pages")

# Extract text from all pages
raw_text = ""
for page_num, page in enumerate(pdf_reader.pages):
    page_text = page.extract_text()
    raw_text += page_text

print(f"Extracted {len(raw_text)} characters total")

PDF loaded with 851 pages
Extracted 1704608 characters total


### **3.2 Text Cleaning and Preprocessing**

Raw text extracted from PDFs often contains:
- Excessive whitespace from formatting
- Control characters from encoding issues
- Irregular line breaks and spacing

Our cleaning function:
- Normalizes whitespace to single spaces
- Removes non-printable control characters
- Preserves semantic content while improving readability

In [None]:
# Clean the extracted text
def clean_extracted_text(text: str) -> str:
    # Replace multiple whitespace with single spaces
    cleaned = re.sub(r'\s+', ' ', text)
    # Remove control characters
    cleaned = re.sub(r'[\x00-\x1F\x7F]', '', cleaned)
    # Strip leading/trailing whitespace
    return cleaned.strip()

document_text = clean_extracted_text(raw_text)
print(f"Cleaned text: {len(document_text)} characters")
print(f"Preview: {document_text[:200]}...")

Cleaned text: 1659247 characters
Preview: Aur√©lien G√©ron Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems TM 2nd Edition Updated for TensorFlow 2 Aur√©lien G√©ron Hands...


### **3.3 Text Chunking Strategy**

Breaking documents into chunks is crucial for RAG systems because:
- **LLM Context Limits**: Models have maximum token limits
- **Retrieval Precision**: Smaller chunks provide more targeted context
- **Semantic Coherence**: Proper chunking preserves meaning

**Key Parameters**:
- **chunk_size**: Maximum characters per chunk (1000)
- **chunk_overlap**: Overlapping characters between chunks (200)
- **Separators**: Hierarchical splitting points (paragraphs ‚Üí sentences ‚Üí words)

The overlap ensures context isn't lost at chunk boundaries.

In [None]:
chunk_size = 1000
chunk_overlap=200

# Set up our text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

print("Creating text chunks...")
text_chunks = text_splitter.split_text(document_text)
print(f"Created {len(text_chunks)} chunks")

Creating text chunks...
Created 2151 chunks


### **3.4 Analyzing Chunk Statistics**

Understanding your chunks helps optimize retrieval performance:
- **Total chunks**: Indicates how many retrievable units exist
- **Average size**: Helps verify chunking strategy effectiveness
- **Size distribution**: Identifies potential issues with splitting

Previewing chunks confirms the content is properly segmented.

In [None]:
# Show info about our chunks
total_chars = sum(len(chunk) for chunk in text_chunks)
avg_chunk_size = total_chars / len(text_chunks) if text_chunks else 0

print(f"Chunk Statistics:")
print(f"   Total chunks: {len(text_chunks)}")
print(f"   Average size: {avg_chunk_size:.0f} characters")
print(f"   Chunk size range: {chunk_size} characters max")
print(f"   Overlap: {chunk_overlap} characters")

# Preview the first chunk
print(f"\nFirst chunk preview:")
print(f"{text_chunks[0][:300]}...")

Chunk Statistics:
   Total chunks: 2151
   Average size: 869 characters
   Chunk size range: 1000 characters max
   Overlap: 200 characters

First chunk preview:
Aur√©lien G√©ron Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems TM 2nd Edition Updated for TensorFlow 2 Aur√©lien G√©ron Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow Concepts, Tools, and Techniques to Bui...


## **4. Building the Vector Database**

### **4.1 Setting Up the Embedding Model**

Embeddings convert text into dense vector representations that capture semantic meaning. We use Google's `text-embedding-004` model:

- **Dimensionality**: 768 dimensions
- **Performance**: State-of-the-art quality and speed
- **Cost-effective**: Free tier available with generous quotas
- **Semantic understanding**: Captures deep contextual relationships
- **Multilingual**: Supports 100+ languages

The embedding model is the foundation of similarity search.

In [None]:
# Set up the embedding model
print("Setting up embeddings model...")
embeddings_model = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004",  # Google's latest embedding model
    task_type="retrieval_document"
)
print("Embeddings model ready!")

Setting up embeddings model...
Embeddings model ready!


### **4.2 Creating Document Objects**

LangChain's Document objects structure our data for the RAG pipeline:

**Components**:
- **page_content**: The actual text content of the chunk
- **metadata**: Additional information for tracking and filtering
  - chunk_id: Unique identifier for each chunk
  - chunk_length: Character count for analysis
  - source: Origin of the document

Metadata enables sophisticated retrieval strategies and source attribution.

In [None]:
# Convert chunks to LangChain documents
print("Converting chunks to documents...")
documents = []

for i, chunk in enumerate(text_chunks):
    doc = Document(
        page_content=chunk,
        metadata={
            "chunk_id": i,
            "chunk_length": len(chunk),
            "source": "pdf_document"
        }
    )
    documents.append(doc)

print(f"Created {len(documents)} document objects")

# Show a sample document
sample_doc = documents[0]
print(f"\nSample document:")
print(f"   Content length: {len(sample_doc.page_content)}")
print(f"   Metadata: {sample_doc.metadata}")
print(f"   Preview: {sample_doc.page_content[:150]}...")

Converting chunks to documents...
Created 2151 document objects

Sample document:
   Content length: 956
   Metadata: {'chunk_id': 0, 'chunk_length': 956, 'source': 'pdf_document'}
   Preview: Aur√©lien G√©ron Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems TM 2nd Edi...


In [None]:
print("Cleaning documents before embedding...")

# This loops through all your documents and cleans their content
for doc in documents:
    doc.page_content = doc.page_content.encode('utf-8', 'replace').decode('utf-8')

print("Cleaning complete!")

Cleaning documents before embedding...
Cleaning complete!


### **4.3 Building the FAISS Vector Store**

FAISS (Facebook AI Similarity Search) provides efficient similarity search:

**Process**:
1. **Embed documents**: Convert each chunk to a vector using the embedding model
2. **Index vectors**: Build an efficient search index structure
3. **Enable retrieval**: Allow fast k-nearest neighbor queries

**Performance**: FAISS can handle millions of vectors with sub-second query times.

**Note**: This step calls the Google Gemini API for each chunk and may take several minutes depending on document size.

In [None]:
# Create the vector database
print("Building searchable vector database...")
print("This might take a few minutes...")

vector_store = FAISS.from_documents(
    documents=documents,      # These are now the cleaned documents
    embedding=embeddings_model
)

print("Vector database created successfully!")
print(f"Indexed {len(documents)} document chunks")

Building searchable vector database...
This might take a few minutes...
Vector database created successfully!
Indexed 2151 document chunks


### **4.4 Testing Vector Search**

Before building the complete RAG chain, verify the retrieval system:

**Similarity Search Process**:
1. Embed the query using the same embedding model
2. Compute cosine similarity between query and all document vectors
3. Return top-k most similar chunks

In [None]:
# Test the search functionality
def test_vector_search(query: str, num_results: int = 3):
    print(f"üîç Searching for: '{query}'")

    # Perform similarity search
    search_results = vector_store.similarity_search(
        query=query,
        k=num_results
    )

    print(f"üìã Found {len(search_results)} relevant chunks:")

    for i, doc in enumerate(search_results, 1):
        print(f"\nüìÑ Result {i}:")
        print(f"   Chunk ID: {doc.metadata.get('chunk_id', 'unknown')}")
        print(f"   Preview: {doc.page_content[:200]}...")

    return search_results

# Test with a sample question
test_query = "What is machine learning?"
search_results = test_vector_search(test_query)

## **5. Integrating the Language Model**

### **5.1 Configuring the LLM**

We use Google's Gemini 2.0 Flash as our generation model:

**Configuration**:
- **model**: "gemini-2.5-pro" - Highly capable model with extended context
- **temperature**: 0.0 - Deterministic, factual responses (reduces creativity/hallucination)

In [None]:
# Set up the language model
print("Setting up AI language model...")
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-pro",
    temperature=0.3  # Low temperature for consistent, factual answers
)
print("Language model ready!")

Setting up AI language model...
Language model ready!


### **5.2 Designing the Prompt Template**

The prompt is crucial for RAG system performance. Our template:

**Instructions to the Model**:
1. **Only use provided context** - Prevents hallucination
2. **Admit limitations** - Be honest when context is insufficient
3. **Cite sources** - Reference relevant parts of context
4. **Stay concise** - Avoid unnecessary elaboration
5. **Don't guess** - Better to say "I don't know"

**Structure**:
- System role instructions
- Context placeholder (filled with retrieved chunks)
- User question
- Response directive

This design ensures answers are grounded in source material.

In [None]:
# Create the prompt template using LCEL
system_prompt = """
You are a helpful AI assistant that answers questions based on the provided context.

Rules:
1. Only use information from the provided context to answer questions
2. If the context doesn't contain enough information, say so honestly
3. Be specific and cite relevant parts of the context
4. Keep your answers clear and concise
5. If you're unsure, admit it rather than guessing

Context:
{context}

Question: {input}

Answer based on the context above:
"""

prompt_template = ChatPromptTemplate.from_template(system_prompt)
print("Prompt template created!")

Prompt template created!


## **6. Building the Complete RAG Chain**

### **6.1 LangChain Expression Language (LCEL)**

We use LCEL to build a composable RAG pipeline:

**Pipeline Components**:
1. **Retriever**: Finds top-k relevant chunks (`search_kwargs={"k": 4}`)
2. **format_docs**: Combines retrieved chunks with double newlines
3. **Prompt Template**: Structures context + question
4. **LLM**: Generates response based on prompt
5. **Output Parser**: Extracts string from LLM response

**LCEL Syntax**:
- `|` operator chains components
- `{}` creates parallel execution
- `RunnablePassthrough()` forwards input unchanged

This creates an end-to-end system: **Question ‚Üí Retrieval ‚Üí Context ‚Üí LLM ‚Üí Answer**

In [None]:
# Import LCEL components
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Create retriever from our vector store
retriever = vector_store.as_retriever(
    search_kwargs={"k": 4}  # Retrieve top 4 most relevant chunks
)

# Define a function to format retrieved documents
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Build the RAG chain using LCEL pipe syntax
rag_chain = (
    {
        "context": retriever | format_docs,  # Retrieve docs and format them
        "input": RunnablePassthrough()       # Pass the question through
    }
    | prompt_template                        # Format the prompt with context and question
    | llm                                   # Send to language model
    | StrOutputParser()                     # Parse the output to a string
)

print("Complete RAG system ready!")
print("You can now ask questions about your document!")

Complete RAG system ready!
You can now ask questions about your document!


## **7. Using the RAG System**

### **7.1 Interactive Question-Answering**

In [None]:
def ask_document_question(question: str):
    print(f"Question: {question}")
    print("Thinking...")

    # Get the answer from our RAG system
    # With LCEL, we pass the question directly as a string
    response = rag_chain.invoke(question)

    # Display the answer
    print(f"\nAnswer:")
    print(f"{response}")

    # To see source documents, we need to get them separately
    source_docs = retriever.invoke(question)
    print(f"\nBased on {len(source_docs)} source chunks:")

    for i, doc in enumerate(source_docs, 1):
        chunk_id = doc.metadata.get('chunk_id', 'unknown')
        print(f"\nSource {i} (Chunk {chunk_id}):")
        print(f"   {doc.page_content[:200]}...")

    print("\n" + "="*80)
    return response

# Test with some questions
questions = [
    "What is Machine Learning?"
]

for question in questions:
    answer = ask_document_question(question)
    print()  # Add some space between questions

Question: What is Machine Learning?
Thinking...

Answer:
Based on the context provided, Machine Learning is defined in a few ways:

*   It is "the science (and art) of programming computers so they can learn from data."
*   A more general definition from Arthur Samuel (1959) describes it as the "field of study that gives computers the ability to learn without being explicitly programmed."
*   A more engineering-oriented definition from Tom Mitchell (1997) states: "A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."

Based on 4 source chunks:

Source 1 (Chunk 4):
   . End-to-End Machine Learning Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

Source 2 (Chunk 2):
   . The views expressed in this work are those of the author, and do not represent the publisher‚Äôs views. While the publisher and the author have used g