# RAG System with Gemini Flash

This notebook implements a Retrieval-Augmented Generation (RAG) system using:
- Sermon transcripts from the dataset as knowledge base
- Vector embeddings for semantic search
- Google Gemini Flash for response generation

Based on the structure from test.ipynb

In [9]:
# Import required libraries
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import os

# LangChain imports
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Load environment variables
load_dotenv()

# Verify API keys
if not os.environ.get("GOOGLE_API_KEY"):
    raise ValueError("GOOGLE_API_KEY is not set in the .env file.")

print("✅ Libraries imported and API key verified")

✅ Libraries imported and API key verified


In [2]:
# Load the sermon dataset
df = pd.read_csv('dataset/sermons_zac.csv')

# Clean the data (similar to test.ipynb)
df = df.dropna(subset=['sermon'])
df['sermon'] = df['sermon'].apply(lambda x: x[6:] if x.lower().startswith('music ') else x)
if 'Unnamed: 0' in df.columns:
    df.drop(columns=['Unnamed: 0'], inplace=True)
df.reset_index(drop=True, inplace=True)

print(f"📊 Dataset loaded: {len(df)} sermons")
print(f"📝 Total words: {df['sermon'].str.split().str.len().sum():,}")
print("\nDataset columns:", df.columns.tolist())
print("\nSample titles:")
for i, title in enumerate(df['title'].head(3)):
    print(f"{i+1}. {title}")

📊 Dataset loaded: 417 sermons
📝 Total words: 3,787,703

Dataset columns: ['author', 'video_id', 'title', 'sermon']

Sample titles:
1. Be Content with the Way God Made You — Zac Poonen — May 07, 2025
2. What Genuine Reverence for God Produces — Zac Poonen — May 04, 2025
3. The Presence of the Lord When We Break Bread — Zac Poonen — May 03, 2025


In [3]:
# Convert dataframe to LangChain documents
documents = []
for index, row in df.iterrows():
    doc = Document(
        page_content=row['sermon'],
        metadata={
            "title": row['title'],
            "author": row['author'],
            "video_id": row['video_id'],
            "doc_id": index
        }
    )
    documents.append(doc)

print(f"📄 Created {len(documents)} documents")
print(f"📏 Average document length: {np.mean([len(doc.page_content) for doc in documents]):.0f} characters")

📄 Created 417 documents
📏 Average document length: 45677 characters


In [4]:
# Split documents into chunks for better retrieval
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

# Split the documents
split_docs = text_splitter.split_documents(documents)

print(f"🔪 Split into {len(split_docs)} chunks")
print(f"📏 Average chunk length: {np.mean([len(doc.page_content) for doc in split_docs]):.0f} characters")

🔪 Split into 63420 chunks
📏 Average chunk length: 496 characters


In [5]:
# Initialize Gemini embeddings
embeddings = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001",
    google_api_key=os.environ.get("GOOGLE_API_KEY")
)

print("🔧 Gemini embeddings initialized")
print("⏳ Creating vector store... (this may take a few minutes)")

🔧 Gemini embeddings initialized
⏳ Creating vector store... (this may take a few minutes)


In [18]:
# Create vector store with FAISS
# Assuming 'split_docs' is your list of all document chunks
# and 'embeddings' is your configured embedding model

# Create an initial vector store with the first batch
vectorstore = FAISS.from_documents(
    documents=split_docs[0:100],  # Start with the first 100 docs
    embedding=embeddings
)

# Loop through the rest of the documents in smaller batches
for i in range(100, len(split_docs), 100):
    # Add subsequent docs to the existing index
    vectorstore.add_documents(split_docs[i:i+100])
    print(f"Processed {i+100}/{len(split_docs)} documents")

print("✅ Vector store created successfully!")
print(f"🗂️ Indexed {len(split_docs)} document chunks")

Processed 200/63420 documents
Processed 300/63420 documents
Processed 400/63420 documents
Processed 500/63420 documents
Processed 600/63420 documents
Processed 700/63420 documents
Processed 800/63420 documents
Processed 900/63420 documents
Processed 1000/63420 documents
Processed 1100/63420 documents
Processed 1200/63420 documents
Processed 1300/63420 documents
Processed 1400/63420 documents
Processed 1500/63420 documents
Processed 1600/63420 documents
Processed 1700/63420 documents
Processed 1800/63420 documents
Processed 1900/63420 documents
Processed 2000/63420 documents
Processed 2100/63420 documents
Processed 2200/63420 documents
Processed 2300/63420 documents
Processed 2400/63420 documents
Processed 2500/63420 documents
Processed 2600/63420 documents
Processed 2700/63420 documents
Processed 2800/63420 documents
Processed 2900/63420 documents
Processed 3000/63420 documents
Processed 3100/63420 documents
Processed 3200/63420 documents
Processed 3300/63420 documents
Processed 3400/6

In [27]:
# Save the vector store to disk
vectorstore.save_local("vectorstore/sermons_vectorstore")
print("💾 Vector store saved to disk as 'sermons_vectorstore'")

💾 Vector store saved to disk as 'sermons_vectorstore'


In [32]:
# Load the vector store from disk
vectorstore_loaded = FAISS.load_local("vectorstore/sermons_vectorstore", embeddings,
allow_dangerous_deserialization=True)

In [19]:
# Initialize Gemini Flash model
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash",
    temperature=0.0,
    google_api_key=os.environ.get("GOOGLE_API_KEY")
)

print("🤖 Gemini Flash model initialized")

🤖 Gemini Flash model initialized


In [33]:
# Create retriever
retriever = vectorstore_loaded.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}  # Retrieve top 5 most similar chunks
)

print("🔍 Retriever configured to return top 5 similar chunks")

🔍 Retriever configured to return top 5 similar chunks


In [34]:
# Define the RAG prompt template
rag_prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant that answers questions based on sermon content from Zac Poonen.

Use the following context from the sermons to answer the question. If the context doesn't contain 
enough information to answer the question, say so honestly.

Context from sermons:
{context}

Question: {question}

Answer: Provide a thoughtful response based on the sermon content. Include relevant Bible verses 
or spiritual insights when mentioned in the context. Be helpful and encouraging in your tone.
""")

print("📝 RAG prompt template created")

📝 RAG prompt template created


In [35]:
# Helper function to format retrieved documents
def format_docs(docs):
    formatted_context = []
    for i, doc in enumerate(docs, 1):
        title = doc.metadata.get('title', 'Unknown Title')
        content = doc.page_content[:500] + "..." if len(doc.page_content) > 500 else doc.page_content
        formatted_context.append(f"Sermon {i}: {title}\n{content}\n")
    return "\n".join(formatted_context)

print("🔧 Document formatting function created")

🔧 Document formatting function created


In [36]:
# Create the RAG chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

print("⛓️ RAG chain created successfully!")
print("🎉 System ready for queries!")

⛓️ RAG chain created successfully!
🎉 System ready for queries!


## Query Interface

Now you can ask questions about the sermon content!

In [37]:
# Function to query the RAG system
def query_sermons(question: str, show_sources: bool = True):
    """
    Query the RAG system with a question about the sermons.
    
    Args:
        question (str): The question to ask
        show_sources (bool): Whether to show the source documents
    
    Returns:
        str: The generated answer
    """
    print(f"❓ Question: {question}")
    print("🔍 Searching for relevant content...")
    
    # Get relevant documents
    relevant_docs = retriever.get_relevant_documents(question)
    
    if show_sources:
        print("\n📚 Sources found:")
        for i, doc in enumerate(relevant_docs, 1):
            title = doc.metadata.get('title', 'Unknown Title')
            print(f"{i}. {title}")
    
    # Generate answer
    print("\n🤖 Generating answer...")
    answer = rag_chain.invoke(question)
    
    print("\n💬 Answer:")
    print(answer)
    
    return answer

print("✅ Query function ready!")

✅ Query function ready!


## Example Queries

Try asking questions about the sermon content:

In [38]:
# Example query 1: About contentment
query_sermons("What does the Bible teach about contentment?")

❓ Question: What does the Bible teach about contentment?
🔍 Searching for relevant content...

📚 Sources found:
1. 6. Never Doubt Christ's Love for You — Zac Poonen — August 25, 2024
2. Be Content with the Way God Made You — Zac Poonen — May 07, 2025
3. 6. Contentment -- Zac Poonen -- March 1, 2009
4. Godliness with Contentment Brings Great Spiritual Profit -- Zac Poonen -- October 2, 2024
5. 6. Never Doubt Christ's Love for You — Zac Poonen — August 25, 2024

🤖 Generating answer...

💬 Answer:
Based on Zac Poonen's sermons, the Bible teaches that contentment is a vital aspect of godliness.  Several sermons emphasize Hebrews 13:5, which encourages us to "make sure your character is free from the love of money; be content with what you have."  This doesn't mean we shouldn't strive for improvement in our circumstances (like seeking a better job if the cost of living increases), but rather that our hearts should not be consumed by a desire for more.  True contentment comes from recognizing 

'Based on Zac Poonen\'s sermons, the Bible teaches that contentment is a vital aspect of godliness.  Several sermons emphasize Hebrews 13:5, which encourages us to "make sure your character is free from the love of money; be content with what you have."  This doesn\'t mean we shouldn\'t strive for improvement in our circumstances (like seeking a better job if the cost of living increases), but rather that our hearts should not be consumed by a desire for more.  True contentment comes from recognizing God\'s provision and being thankful for what He has given, regardless of the amount.\n\nPoonen shares a personal testimony of 56 years of marriage with his wife, where they have consistently been content with their provision, never once asking God for more.  This contentment, he testifies, has brought them great happiness.  He highlights that this contentment is not passive resignation, but an active choice to trust in God\'s provision and care.  He connects contentment to spiritual growth

In [26]:
query_sermons("What age does Zac recommend men and women to consider marriage?")

❓ Question: What age does Zac recommend men and women to consider marriage?
🔍 Searching for relevant content...

📚 Sources found:
1. RLCF Men's Conference 2025 Q&A Session #2 — Zac Poonen — April 19, 2025
2. Global Online Meeting Q&A -- Zac Poonen -- March 13, 2021
3. Questions & Answers - Zac Poonen - September 13, 2014
4. 5. Seeing Jesus Clearly in the Bible — Zac Poonen — August 25, 2024
5. Questions & Answers - Zac Poonen - September 13, 2014

🤖 Generating answer...

💬 Answer:
Based on Zac Poonen's sermons, there's no single recommended age for marriage.  He emphasizes that the decision is deeply personal and should be guided by prayer and surrender to God's will.

Several points emerge from his counsel:

* **Spiritual Readiness:**  He highlights the importance of spiritual maturity before marriage.  In Sermon 5, he advises against marrying too young, explaining that it can hinder the dedicated study of scripture, which is crucial for a strong spiritual life.  He personally married

'Based on Zac Poonen\'s sermons, there\'s no single recommended age for marriage.  He emphasizes that the decision is deeply personal and should be guided by prayer and surrender to God\'s will.\n\nSeveral points emerge from his counsel:\n\n* **Spiritual Readiness:**  He highlights the importance of spiritual maturity before marriage.  In Sermon 5, he advises against marrying too young, explaining that it can hinder the dedicated study of scripture, which is crucial for a strong spiritual life.  He personally married at 28 1/2, after being converted at 19 1/2, suggesting a period of spiritual growth before marriage.\n\n* **Financial Stability:** Sermon 3 mentions 1 Corinthians 7:9 ("better to marry than to burn"), suggesting that being old enough and financially capable of supporting a family is a factor to consider.  This is presented as one way to discern God\'s timing, not as a strict rule.\n\n* **Avoiding Immorality:**  Sermon 3 also references 1 Corinthians 7:2 ("let every man hav

In [None]:
# Example query 2: About marriage and family
query_sermons("What guidance does the Bible provide for marriage and family relationships?")

In [None]:
# Example query 3: About spiritual gifts
query_sermons("How should Christians view their spiritual gifts and talents?")

In [None]:
# Interactive query cell - modify this to ask your own questions
your_question = "What does Zac Poonen teach about godliness?"
query_sermons(your_question)

## Advanced Features

Additional functionality for the RAG system:

In [None]:
# Function to search for specific topics
def search_topic(topic: str, num_results: int = 3):
    """
    Search for sermons containing a specific topic.
    
    Args:
        topic (str): The topic to search for
        num_results (int): Number of results to return
    
    Returns:
        list: List of relevant documents
    """
    print(f"🔍 Searching for topic: '{topic}'")
    
    # Search for relevant documents
    docs = vectorstore.similarity_search(topic, k=num_results)
    
    print(f"\n📋 Found {len(docs)} relevant sermon excerpts:")
    
    for i, doc in enumerate(docs, 1):
        title = doc.metadata.get('title', 'Unknown Title')
        preview = doc.page_content[:200] + "..." if len(doc.page_content) > 200 else doc.page_content
        print(f"\n{i}. **{title}**")
        print(f"   {preview}")
    
    return docs

print("🔍 Topic search function ready!")

In [None]:
# Example topic search
search_topic("faith and trust in God")

In [None]:
# Function to get sermon statistics
def get_sermon_stats():
    """
    Display statistics about the sermon dataset.
    """
    print("📊 Sermon Dataset Statistics:")
    print(f"   • Total sermons: {len(df)}")
    print(f"   • Total words: {df['sermon'].str.split().str.len().sum():,}")
    print(f"   • Average words per sermon: {df['sermon'].str.split().str.len().mean():.0f}")
    print(f"   • Document chunks in vector store: {len(split_docs)}")
    
    # Count Q&A sessions
    qa_count = df['title'].str.contains('Questions & Answers|Q&A', case=False, na=False).sum()
    print(f"   • Q&A sessions: {qa_count}")
    print(f"   • Regular sermons: {len(df) - qa_count}")
    
    print(f"\n📅 Date range: Based on titles, sermons span multiple years")
    print(f"👨‍🏫 Primary speaker: {df['author'].iloc[0]}")

get_sermon_stats()

## System Information

This RAG system provides:

1. **Knowledge Base**: 400+ sermon transcripts from Zac Poonen
2. **Embeddings**: Google's embedding-001 model for semantic search
3. **Vector Store**: FAISS for efficient similarity search
4. **Generation**: Gemini Flash for natural language responses
5. **Retrieval**: Top-5 most relevant sermon excerpts for each query

### Usage Tips:
- Ask specific questions about biblical topics, spiritual growth, or Christian living
- The system will find relevant sermon content and provide contextual answers
- Responses include biblical references and spiritual insights from the sermons
- Use the `search_topic()` function to explore specific themes

### Example Questions:
- "How can I grow in faith?"
- "What does the Bible say about forgiveness?"
- "How should Christians handle trials and difficulties?"
- "What is the role of prayer in a believer's life?"