# RAG with PDF using Gemini SDK

This notebook demonstrates how to implement a Retrieval-Augmented Generation (RAG) system with PDF documents using Google's Gemini SDK. We'll go through the following steps:

1. Import and setup the necessary libraries
2. Load and process a PDF document
3. Generate text embeddings
4. Create a vector store for efficient retrieval
5. Implement the RAG pipeline with Gemini
6. Query the system with questions about the PDF document

Let's get started!

## 1. Import and Setup Libraries

Let's start by importing the necessary libraries:
- `google.generativeai`: The Gemini SDK for accessing Google's generative AI models
- `PyPDF2`: For parsing PDF files
- `langchain`: Framework for working with language models and building RAG applications
- `faiss`: For vector storage and similarity search
- Other utility libraries

In [None]:
# Import necessary libraries
import os
from google import genai
import PyPDF2
import numpy as np
from typing import List, Dict, Any
from google.genai import types

# For vector storage and retrieval
import faiss

# For text processing
import re
from langchain_text_splitters import RecursiveCharacterTextSplitter

# For env vars (API keys)
from dotenv import load_dotenv
load_dotenv()  # Load environment variables from .env file

# Set up Google API key
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("Please set your GOOGLE_API_KEY in .env file or environment variables")

client = genai.Client(api_key=GOOGLE_API_KEY)
print("All libraries imported successfully!")

## 2. Load and Process PDF Document

Now we'll load and process the PDF file `sample_pdf.pdf`. We'll extract all the text from the document and then split it into smaller chunks for embedding and retrieval.

In [None]:
def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text content from a PDF file."""
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"
        return text

# Define the path to the PDF file
pdf_path = "../notebooks/sample_pdf.pdf"

# Extract text from the PDF
raw_text = extract_text_from_pdf(pdf_path)

# Display the first 500 characters of the extracted text
print(f"PDF Length: {len(raw_text)} characters")
print("\nPreview of the extracted text:")
print(raw_text[:500] + "...")

In [None]:
# Split the text into smaller chunks for processing
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
)

# Split the text into chunks
text_chunks = text_splitter.split_text(raw_text)

print(f"Split the document into {len(text_chunks)} chunks")
print(f"\nSample chunk (first chunk):\n{text_chunks[0]}")

## 3. Generate Embeddings

We'll use Google's embedding model to convert our text chunks into vector embeddings. These embeddings capture the semantic meaning of the text and allow us to perform similarity searches.

In [None]:

def get_embedding(text, model="text-embedding-004", task_type="RETRIEVAL_DOCUMENT", title=None):
    """
    Get embedding vector for text using Google's Embedding API
    
    Args:
        text (str): The text to embed
        model (str): The model to use
        task_type (str): The task type for the embedding
        title (str, optional): Optional title for the document
        
    Returns:
        list: The embedding vector
    """
    try:
        from google.genai import types
        config = types.EmbedContentConfig(
            task_type=task_type,
            title=title
        )
        
        result = client.models.embed_content(
            model=model,
            contents=text,
            config=config
        )
        
        return result.embeddings[0].values if result.embeddings else None
    except Exception as e:
        print(f"Error getting embedding: {str(e)}")
        return None

# Generate embeddings for all text chunks
embeddings = []
for i, chunk in enumerate(text_chunks):
    print(f"Generating embedding for chunk {i+1}/{len(text_chunks)} ({(i+1)/len(text_chunks)*100:.1f}%)...", end="\r")
    embedding = get_embedding(chunk)
    embeddings.append(embedding)
print(f"\nGenerated {len(embeddings)} embeddings")

# Convert to numpy array for further processing
embeddings_array = np.array(embeddings).astype('float32')
print(f"Embedding dimensions: {embeddings_array.shape}")  # Should be (num_chunks, embedding_dim)

## 4. Create Vector Store

Now we'll create a vector store using FAISS, a library for efficient similarity search. This will allow us to quickly find the most similar text chunks to a query.

In [None]:
# Create a FAISS index for vector search
embedding_dimension = len(embeddings[0])
index = faiss.IndexFlatL2(embedding_dimension)  # L2 distance for similarity search

# Add vectors to the index
index.add(embeddings_array)
print(f"Added {index.ntotal} vectors to the FAISS index")

def search_similar_chunks(query: str, top_k: int = 3) -> List[Dict[str, Any]]:
    """
    Search for chunks similar to the query and return them with their similarity scores.
    
    Args:
        query: The search query
        top_k: Number of results to return
        
    Returns:
        List of dictionaries with 'chunk', 'score', and 'id' keys
    """
    # Get query embedding
    query_embedding = get_embedding(query)
    query_embedding_array = np.array([query_embedding]).astype('float32')
    
    # Search in the index
    distances, indices = index.search(query_embedding_array, top_k)
    
    # Format results
    results = []
    for i, idx in enumerate(indices[0]):
        if idx != -1:  # Valid result
            results.append({
                'chunk': text_chunks[idx],
                'score': float(1 / (1 + distances[0][i])),  # Convert distance to similarity score
                'id': int(idx)
            })
    
    return results

# Test the search function
test_query = "What is this document about?"
similar_chunks = search_similar_chunks(test_query, top_k=2)

print("Test search results:")
for i, result in enumerate(similar_chunks):
    print(f"\nResult {i+1} (Score: {result['score']:.4f}):")
    print(f"Chunk {result['id']}: {result['chunk'][:100]}...")

## 5. Implement RAG Pipeline

Now we'll implement the full RAG pipeline, which involves:
1. Taking a user query
2. Finding relevant text chunks from our document
3. Using these chunks as context for Gemini to generate an accurate answer

In [None]:

def generate_rag_response(query: str, top_k: int = 3) -> str:
    """
    Generate a response to a query using RAG with the Gemini model.
    
    Args:
        query: User query
        top_k: Number of similar chunks to retrieve
        
    Returns:
        Generated response from the model
    """
    # 1. Retrieve similar chunks
    similar_chunks = search_similar_chunks(query, top_k=top_k)
    context = "\n\n".join([result["chunk"] for result in similar_chunks])
    
    # 2. Create prompt with context
    prompt = f"""
    You are an AI assistant that answers questions based on provided context information. 
    Answer the following question using ONLY the context provided below. 
    If you can't answer the question based on the context, say "I don't have enough information to answer this question."
    
    CONTEXT:
    {context}
    
    QUESTION:
    {query}
    
    ANSWER:
    """
    
    # 3. Generate response with Gemini
    response = client.models.generate_content(
        model="gemini-2.0-flash",  # Using Gemini 2.0 Flash - a fast, efficient model
        contents=prompt,           # Your input prompt/question
        # Configuration parameters to control the generation
        config=types.GenerateContentConfig(
            temperature=0.8,          # Controls randomness: lower = more deterministic outputs
        ),
    )
    
    return response.text

# Test the RAG pipeline
test_question = "What is the main topic of this document?"
rag_answer = generate_rag_response(test_question)

print(f"Question: {test_question}")
print(f"\nAnswer: {rag_answer}")

## 6. Query the System

Now that our RAG system is set up, let's ask it some questions about the PDF document. You can try different questions to see how the system responds.

In [None]:
# Function for interactive questioning
def ask_document(question: str) -> None:
    """
    Ask a question about the document and get a response using our RAG pipeline.
    
    Args:
        question: The question to ask
    """
    print(f"\n🔍 Searching for relevant context...")
    answer = generate_rag_response(question, top_k=3)
    print(f"\n🤖 Answer: {answer}")

# Try some sample questions
questions = [
    "What is the main topic of this document?",
    "What are the key points made in this document?",
    "Can you summarize the document for me?"
]

for question in questions:
    print(f"\n\n>>> Question: {question}")
    ask_document(question)

In [None]:
# Interactive cell for user questions
user_question = "What is the significance of this document?" # Change this to your question

ask_document(user_question)

## 7. Conclusion

In this notebook, we've demonstrated how to implement a simple RAG system with PDF documents using the Gemini SDK:

1. We loaded and extracted text from a PDF document
2. We split the text into manageable chunks
3. We generated embeddings for each chunk using Google's embedding model
4. We created a FAISS index for efficient similarity search
5. We set up a RAG pipeline that retrieves relevant context and generates answers with Gemini
6. We tested the system with various questions about the document

This approach allows for more accurate and grounded responses from the LLM by providing it with relevant context from our document. The system can be extended to handle multiple documents, different file formats, and more sophisticated retrieval methods.