# NotebookLM Clone: RAG Implementation with Gemini

This notebook demonstrates how the RAG (Retrieval Augmented Generation) system works behind the scenes in our NotebookLM Clone application. We'll walk through the core components of the system step by step.

## 1. Setup and Configuration

First, let's install and import all the necessary libraries.

In [None]:
# Install required packages if not already installed
!pip install google-generativeai PyPDF2 langchain faiss-cpu sentence-transformers numpy python-dotenv

In [None]:
import os
import sys
import PyPDF2
import google.generativeai as genai
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
import numpy as np

# Load environment variables from .env file
load_dotenv()

# Configure Gemini API
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
if not GEMINI_API_KEY:
    print("Error: GEMINI_API_KEY not found in environment variables")
    print("Please create a .env file with your Gemini API key")
    sys.exit(1)
    
genai.configure(api_key=GEMINI_API_KEY)
model = genai.GenerativeModel('gemini-1.5-pro')

print("Setup complete!")

## 2. Creating the RAG System

Now let's implement our RAG system class that handles document processing and querying.

In [None]:
class RAGSystem:
    def __init__(self):
        # Initialize sentence transformer for embeddings
        print("Loading embedding model...")
        self.embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
        
        # Initialize text splitter
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,  # Characters per chunk
            chunk_overlap=100,  # Overlap between chunks to maintain context
            separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]  # Priority order for splitting
        )
        
        self.vector_store = None
        self.documents = []
        
        print("RAG system initialized!")
    
    def process_document(self, text, document_name="document"):
        print(f"Processing document: {document_name}")
        
        # Split text into chunks
        print("Splitting text into chunks...")
        chunks = self.text_splitter.split_text(text)
        print(f"Created {len(chunks)} chunks")
        
        # Store original documents for reference
        self.documents.append({
            "name": document_name,
            "text": text,
            "chunks": chunks
        })
        
        # Create or update vector store
        print("Creating vector embeddings...")
        if self.vector_store is None:
            self.vector_store = FAISS.from_texts(chunks, self.embeddings)
        else:
            # Add new documents to existing vector store
            self.vector_store.add_texts(chunks)
            
        print("Document processed successfully!")
        return len(chunks)
    
    def extract_text_from_pdf(self, pdf_path):
        print(f"Extracting text from PDF: {pdf_path}")
        text = ""
        
        with open(pdf_path, 'rb') as pdf_file:
            pdf_reader = PyPDF2.PdfReader(pdf_file)
            for page_num in range(len(pdf_reader.pages)):
                page_text = pdf_reader.pages[page_num].extract_text()
                text += page_text + "\n\n"
                
        print(f"Extracted {len(text)} characters from {len(pdf_reader.pages)} pages")
        return text
    
    def query(self, question, k=3):
        if self.vector_store is None:
            return "Please process a document first."
        
        print(f"Querying: {question}")
        
        # Search for relevant documents
        print(f"Finding top {k} relevant chunks...")
        docs = self.vector_store.similarity_search(question, k=k)
        
        # Create context from relevant documents
        context = "\n\n---\n\n".join([doc.page_content for doc in docs])
        
        print("Generating response using Gemini...")
        # Generate response using Gemini
        prompt = f"""
        Answer the question based on the following context. If the answer is not in the context, just say that you don't know.
        
        Context:
        {context}
        
        Question: {question}
        
        Answer:
        """
        
        response = model.generate_content(prompt)
        
        return {
            "answer": response.text,
            "context": context,
            "chunks_used": [doc.page_content for doc in docs]
        }


# Create an instance of our RAG system
rag = RAGSystem()

## 3. Sample Document Process & Query

Let's process a sample text document and try to query it:

In [None]:
# Sample text about artificial intelligence
sample_text = """
# Introduction to Artificial Intelligence

Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to natural intelligence displayed by humans or animals. Leading AI textbooks define the field as the study of "intelligent agents": any system that perceives its environment and takes actions that maximize its chance of achieving its goals.

## History of AI

The field of AI research was founded at a workshop held on the campus of Dartmouth College in the summer of 1956. The attendees, including John McCarthy, Marvin Minsky, Allen Newell, and Herbert Simon, became the founders and leaders of AI research. They and their students produced programs that were described as "astonishing": computers were learning checkers strategies, solving word problems in algebra, proving logical theorems, and speaking English.

By the middle of the 1960s, research in the U.S. was heavily funded by the Department of Defense and laboratories had been established around the world. AI's founders were optimistic about the future: Herbert Simon predicted, "machines will be capable, within twenty years, of doing any work a man can do."

## AI Approaches

### Machine Learning
Machine learning (ML) is a subset of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. ML focuses on the development of computer programs that can access data and use it to learn for themselves.

Key machine learning algorithms include:
1. Supervised learning (classification, regression)
2. Unsupervised learning (clustering, dimensionality reduction)
3. Reinforcement learning

### Deep Learning
Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks, and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, and more.

## AI Applications

### Healthcare
AI in healthcare is used for tasks such as diagnosis of diseases, drug discovery, and personalized medicine. IBM's Watson for Oncology is trained to help doctors treat cancer patients by analyzing patient data and medical literature.

### Finance
AI is revolutionizing the finance industry through algorithmic trading, fraud detection, and customer service chatbots. JP Morgan's COIN program interprets commercial loan agreements in seconds, a task that previously took 360,000 hours of work annually by lawyers.

### Transportation
Self-driving vehicles use various AI techniques including computer vision and decision-making algorithms. Companies like Tesla, Waymo, and many traditional automakers are working on autonomous vehicles.

## Ethical Considerations

The development of AI raises ethical concerns related to privacy, security, and potential job displacement. Issues such as algorithmic bias, accountability, and the long-term impact of AI on human society are actively discussed by researchers, policymakers, and industry leaders.

## Future of AI

While narrow AI is focused on specific tasks, the long-term goal of many researchers is to create artificial general intelligence (AGI) - AI that can perform any intellectual task that a human can. However, experts have varying opinions on when, if ever, AGI will be achieved.
"""

# Process the sample text
chunk_count = rag.process_document(sample_text, "AI_Introduction")
print(f"\nDocument processed into {chunk_count} chunks.")

Now let's ask some questions about the document:

In [None]:
# Ask a question about the content
question1 = "When was AI research founded?"
result1 = rag.query(question1)

print("\nQuestion:", question1)
print("\nAnswer:")
print(result1["answer"])

print("\nRelevant chunks used:")
for i, chunk in enumerate(result1["chunks_used"]):
    print(f"\nChunk {i+1}:\n{chunk}")

In [None]:
# Ask another question
question2 = "What are the key types of machine learning algorithms?"
result2 = rag.query(question2)

print("\nQuestion:", question2)
print("\nAnswer:")
print(result2["answer"])

print("\nRelevant chunks used:")
for i, chunk in enumerate(result2["chunks_used"]):
    print(f"\nChunk {i+1}:\n{chunk}")

In [None]:
# Ask a question that's not in the document
question3 = "What is the capital of France?"
result3 = rag.query(question3)

print("\nQuestion:", question3)
print("\nAnswer:")
print(result3["answer"])

print("\nRelevant chunks used:")
for i, chunk in enumerate(result3["chunks_used"]):
    print(f"\nChunk {i+1}:\n{chunk}")

## 4. Advanced: System Performance Analysis

Let's analyze how chunk size and retrieval parameters affect performance:

In [None]:
def test_chunk_sizes(text, question, chunk_sizes=[500, 1000, 2000], overlaps=[50, 100, 200]):
    results = {}
    
    for size in chunk_sizes:
        for overlap in overlaps:
            print(f"\nTesting chunk_size={size}, overlap={overlap}")
            
            # Create new RAG system with specific parameters
            test_rag = RAGSystem()
            test_rag.text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=size,
                chunk_overlap=overlap
            )
            
            # Process document and query
            num_chunks = test_rag.process_document(text, f"chunk_{size}_{overlap}")
            result = test_rag.query(question)
            
            # Store results
            results[f"size_{size}_overlap_{overlap}"] = {
                "num_chunks": num_chunks,
                "answer": result["answer"],
                "chunks_used": result["chunks_used"]
            }
    
    return results

# We'll test this function with a specific question
performance_q = "What are the applications of AI in healthcare?"
# Uncomment to run the test (takes some time)
# performance_results = test_chunk_sizes(sample_text, performance_q)

## 5. Conclusion

This notebook demonstrates the core RAG system used in our NotebookLM Clone application. The key components are:

1. **Document processing**: Converting text into manageable chunks
2. **Vector embeddings**: Creating numeric representations of text for semantic search
3. **Similarity search**: Finding the most relevant chunks when a question is asked
4. **Generative AI**: Using Gemini to generate answers based on the retrieved context

The web application wraps this functionality in a user-friendly interface, allowing users to upload documents and interact with them through natural language questions.