# Simple PDF Q&A with Vector Database

This notebook provides a streamlined workflow for:

1. Reading PDF files
2. Storing data in vector database
3. Answering questions about the content

Perfect for quick document analysis and Q&A!


## Install Required Packages

Install all necessary libraries for PDF processing and vector search.


In [1]:
# Install required packages
%pip install langchain-community langchain-google-genai faiss-cpu pypdf python-dotenv google-generativeai

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Import Libraries and Setup


In [2]:
# Import required libraries
import os
from dotenv import load_dotenv
from typing import List, Dict, Any

# LangChain imports
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.schema import Document

# Google AI for Gemini
import google.generativeai as genai

print("✓ All libraries imported successfully!")

  from .autonotebook import tqdm as notebook_tqdm


✓ All libraries imported successfully!


## Configure API Key

Set up your Google AI API key for Gemini embeddings and text generation.


In [3]:
# Load environment variables
load_dotenv()

# Configure Google AI API key
# Option 1: Set in .env file as GOOGLE_API_KEY=your-api-key
# Option 2: Uncomment and set directly below
# os.environ["GOOGLE_API_KEY"] = "your-google-ai-api-key-here"

if "GOOGLE_API_KEY" in os.environ:
    genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
    print("✓ Google AI API key configured successfully")
else:
    print("⚠️ Please set your GOOGLE_API_KEY")
    print("Get your API key from: https://aistudio.google.com/app/apikey")

✓ Google AI API key configured successfully


## PDF Reader and Vector Database Functions

Core functions for processing PDFs and creating searchable vector databases.


In [4]:
class PDFVectorDB:
    def __init__(self):
        self.embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
        self.vector_store = None
        self.documents = []
        self.current_pdf_path = None
        self.database_created = False
        
    def load_pdf(self, pdf_path: str, chunk_size: int = 300, chunk_overlap: int = 50):
        """
        Load and process a PDF file into searchable chunks.
        
        Args:
            pdf_path: Path to the PDF file
            chunk_size: Size of text chunks (reduced for better granularity)
            chunk_overlap: Overlap between chunks
        """
        # Check if this PDF is already loaded
        if self.current_pdf_path == pdf_path and self.documents:
            print(f"📖 PDF '{pdf_path}' already loaded with {len(self.documents)} chunks")
            return
            
        print(f"📖 Loading PDF: {pdf_path}")
        
        try:
            # Load PDF
            loader = PyPDFLoader(pdf_path)
            documents = loader.load()
            
            print(f"✓ Loaded {len(documents)} pages")
            
            # Preview the raw content
            for i, doc in enumerate(documents):
                content_preview = doc.page_content.replace('\n', ' ')[:200]
                print(f"Page {i+1} content: {content_preview}...")
            
            # Use improved car-specific chunking
            self.documents = self._create_car_specific_chunks(documents)
            self.current_pdf_path = pdf_path
            self.database_created = False  # Reset database status
            
            print(f"✓ Created {len(self.documents)} text chunks")
            
            # Preview chunks with more details
            for i, doc in enumerate(self.documents):
                preview = doc.page_content.replace('\n', ' ')[:150]
                print(f"Chunk {i+1} ({len(doc.page_content)} chars): {preview}...")
                
        except Exception as e:
            print(f"❌ Error loading PDF: {str(e)}")
    
    def _create_car_specific_chunks(self, documents: List[Document]) -> List[Document]:
        """
        Create car-specific chunks that isolate individual car information.
        """
        all_chunks = []
        car_brands = ["Toyota", "Honda", "Ford", "BMW", "Audi", "Mercedes", "Nissan", "Hyundai", "Chevrolet", "Volkswagen"]
        
        for doc in documents:
            content = doc.page_content
            lines = content.split('\n')
            
            # Find car entries using multiple strategies
            car_chunks = []
            current_car_info = []
            
            for line in lines:
                line = line.strip()
                if not line:
                    continue
                
                # Strategy 1: Look for lines that start with car brands
                line_starts_with_brand = any(line.startswith(brand) for brand in car_brands)
                
                # Strategy 2: Look for lines that contain car brands anywhere
                line_contains_brand = any(brand in line for brand in car_brands)
                
                # Strategy 3: Look for car model patterns (Brand + Model + Year)
                words = line.split()
                has_year = any(word.isdigit() and len(word) == 4 and 2000 <= int(word) <= 2025 for word in words)
                
                # If this looks like a new car entry and we have accumulated info, save it
                if (line_starts_with_brand or (line_contains_brand and has_year)) and current_car_info:
                    # Save the previous car info
                    car_text = '\n'.join(current_car_info)
                    if len(car_text.strip()) > 20:
                        chunk_doc = Document(
                            page_content=car_text,
                            metadata={**doc.metadata, 'chunk_type': 'individual_car', 'car_info': True}
                        )
                        car_chunks.append(chunk_doc)
                    current_car_info = []
                
                # Add current line to the accumulating car info
                current_car_info.append(line)
                
                # Also look for additional car details in subsequent lines
                if any(keyword in line.lower() for keyword in ['mileage:', 'color:', 'price:', 'condition:']):
                    current_car_info.append(line)
            
            # Don't forget the last car
            if current_car_info:
                car_text = '\n'.join(current_car_info)
                if len(car_text.strip()) > 20:
                    chunk_doc = Document(
                        page_content=car_text,
                        metadata={**doc.metadata, 'chunk_type': 'individual_car', 'car_info': True}
                    )
                    car_chunks.append(chunk_doc)
            
            # If car-specific chunking produced good results, use them
            if len(car_chunks) >= 2:
                print(f"✓ Successfully created {len(car_chunks)} car-specific chunks")
                all_chunks.extend(car_chunks)
            else:
                # Fallback: Try to parse the content differently
                print("🔄 Trying alternative parsing strategy...")
                alternative_chunks = self._parse_car_inventory_format(content, doc.metadata)
                if alternative_chunks:
                    all_chunks.extend(alternative_chunks)
                else:
                    # Last resort: traditional chunking
                    print("🔄 Using traditional text splitting as fallback...")
                    text_splitter = CharacterTextSplitter(
                        chunk_size=200,  # Smaller chunks for better precision
                        chunk_overlap=20,
                        separator="\n"
                    )
                    traditional_chunks = text_splitter.split_documents([doc])
                    all_chunks.extend(traditional_chunks)
        
        return all_chunks
    
    def _parse_car_inventory_format(self, content: str, metadata: dict) -> List[Document]:
        """
        Alternative parsing strategy for car inventory format.
        """
        chunks = []
        
        # Look for car patterns more aggressively
        lines = content.split('\n')
        car_brands = ["Toyota", "Honda", "Ford", "BMW", "Audi", "Mercedes", "Nissan", "Hyundai"]
        
        # Find all lines that mention car brands
        car_lines = []
        for i, line in enumerate(lines):
            if any(brand in line for brand in car_brands):
                # Collect this line and the next few lines that might contain details
                car_block = [line]
                for j in range(i+1, min(i+6, len(lines))):  # Look ahead up to 5 lines
                    next_line = lines[j].strip()
                    if next_line and any(keyword in next_line.lower() for keyword in ['mileage', 'color', 'price', 'condition', 'km', '$']):
                        car_block.append(next_line)
                    elif any(brand in next_line for brand in car_brands):
                        break  # Stop if we hit another car
                
                if len(car_block) > 1:  # Only add if we found additional details
                    car_text = ' '.join(car_block)
                    chunk_doc = Document(
                        page_content=car_text,
                        metadata={**metadata, 'chunk_type': 'parsed_car', 'car_info': True}
                    )
                    chunks.append(chunk_doc)
        
        # Also try splitting by common separators and filtering for car content
        if not chunks:
            # Try splitting by sentences or common patterns
            import re
            sentences = re.split(r'[.!?]\s+', content)
            for sentence in sentences:
                if any(brand in sentence for brand in car_brands) and len(sentence.strip()) > 30:
                    chunk_doc = Document(
                        page_content=sentence.strip(),
                        metadata={**metadata, 'chunk_type': 'sentence_car', 'car_info': True}
                    )
                    chunks.append(chunk_doc)
        
        return chunks
    
    def create_vector_database(self, save_path: str = "pdf_vector_db"):
        """
        Create vector database from loaded documents.
        
        Args:
            save_path: Path to save the vector database
        """
        if not self.documents:
            print("❌ No documents loaded. Please load a PDF first.")
            return
            
        # Check if database is already created for current documents
        if self.database_created and self.vector_store is not None:
            print(f"✓ Vector database already exists with {len(self.documents)} document chunks")
            return
            
        print("🔄 Creating vector database with Gemini embeddings...")
        
        try:
            # Create FAISS vector store
            self.vector_store = FAISS.from_documents(self.documents, self.embeddings)
            
            # Save to disk
            self.vector_store.save_local(save_path)
            self.database_created = True
            
            print(f"✓ Vector database created and saved to '{save_path}'")
            print(f"📊 Database contains {len(self.documents)} document chunks")
            
            # Show chunk types distribution
            chunk_types = {}
            for doc in self.documents:
                chunk_type = doc.metadata.get('chunk_type', 'standard')
                chunk_types[chunk_type] = chunk_types.get(chunk_type, 0) + 1
            
            print("📊 Chunk types distribution:")
            for chunk_type, count in chunk_types.items():
                print(f"   • {chunk_type}: {count} chunks")
            
        except Exception as e:
            print(f"❌ Error creating vector database: {str(e)}")
    
    def search(self, query: str, k: int = 3) -> List[Document]:
        """
        Search the vector database for relevant content with enhanced debugging.
        
        Args:
            query: Search query
            k: Number of results to return
            
        Returns:
            List of relevant documents
        """
        if not self.vector_store:
            print("❌ No vector database found. Please create one first.")
            return []
            
        print(f"🔍 Searching for: '{query}'")
        print(f"📊 Database contains {len(self.documents)} total chunks")
        
        try:
            # Search with scores
            results_with_scores = self.vector_store.similarity_search_with_score(query, k=k)
            
            print(f"✓ Found {len(results_with_scores)} relevant results:")
            print("=" * 80)
            
            results = []
            for i, (doc, score) in enumerate(results_with_scores):
                print(f"📄 Result {i+1} (Similarity Score: {score:.4f}):")
                
                # Clean and format content for better readability
                content = doc.page_content.strip()
                if len(content) > 300:
                    content = content[:300] + "..."
                
                print(f"Content: {content}")
                print(f"📍 Source: Page {doc.metadata.get('page', 'Unknown')}")
                print(f"🏷️ Chunk Type: {doc.metadata.get('chunk_type', 'standard')}")
                print(f"🚗 Car Info: {'Yes' if doc.metadata.get('car_info') else 'No'}")
                
                # Check query term presence
                query_lower = query.lower()
                content_lower = doc.page_content.lower()
                if query_lower in content_lower:
                    print(f"✅ Query term '{query}' found in content")
                else:
                    print(f"🔍 Semantic match (no direct text match)")
                
                print("-" * 80)
                results.append(doc)
            
            if not results:
                print("ℹ️ No results found. Trying diagnostic search...")
                self._debug_search(query)
            
            return results
            
        except Exception as e:
            print(f"❌ Error during search: {str(e)}")
            return []
    
    def _debug_search(self, query: str):
        """
        Debug helper to understand why search might not be working.
        """
        print("\n🔧 Search Debug Information:")
        print(f"   • Query: '{query}'")
        print(f"   • Total chunks in database: {len(self.documents)}")
        
        # Check if query terms appear in any chunks
        query_lower = query.lower()
        matching_chunks = []
        
        for i, doc in enumerate(self.documents):
            if query_lower in doc.page_content.lower():
                matching_chunks.append((i, doc))
        
        print(f"   • Chunks containing '{query}': {len(matching_chunks)}")
        
        if matching_chunks:
            print("   • Direct matches found in:")
            for i, (chunk_idx, doc) in enumerate(matching_chunks[:3]):
                preview = doc.page_content.replace('\n', ' ')[:100]
                print(f"     - Chunk {chunk_idx + 1}: {preview}...")
        else:
            print("   • No direct text matches found")
            print("   • Sample of available content:")
            for i, doc in enumerate(self.documents[:3]):
                preview = doc.page_content.replace('\n', ' ')[:100]
                chunk_type = doc.metadata.get('chunk_type', 'standard')
                print(f"     - Chunk {i + 1} ({chunk_type}): {preview}...")
    
    def answer_question(self, question: str) -> str:
        """
        Answer a question using Gemini based on the document content.
        
        Args:
            question: Question to answer
            
        Returns:
            AI-generated answer
        """
        if not self.vector_store:
            return "❌ No vector database available. Please load and process a PDF first."
        
        print(f"❓ Question: {question}")
        
        try:
            # Get relevant context with more results for better coverage
            relevant_docs = self.vector_store.similarity_search(question, k=5)
            context = "\n\n".join([doc.page_content for doc in relevant_docs])
            
            print(f"📄 Using {len(relevant_docs)} relevant document chunks for context")
            
            # Use Gemini to generate answer
            model = genai.GenerativeModel('gemini-2.0-flash-exp')
            
            prompt = f"""
            Based on the following document content, please answer the question accurately and concisely.
            If the information is not available in the context, please say so.
            
            Context:
            {context}
            
            Question: {question}
            
            Answer:
            """
            
            response = model.generate_content(prompt)
            answer = response.text
            
            print(f"🤖 Answer: {answer}")
            return answer
            
        except Exception as e:
            error_msg = f"Error generating answer: {str(e)}"
            print(f"❌ {error_msg}")
            return error_msg
    
    def load_existing_database(self, db_path: str):
        """
        Load an existing vector database from disk.
        
        Args:
            db_path: Path to the saved vector database
        """
        try:
            self.vector_store = FAISS.load_local(
                db_path, 
                self.embeddings, 
                allow_dangerous_deserialization=True
            )
            self.database_created = True
            print(f"✓ Loaded existing vector database from '{db_path}'")
        except Exception as e:
            print(f"❌ Failed to load database: {str(e)}")
    
    def get_status(self):
        """
        Get current status of the PDF Vector Database.
        """
        print("📊 PDF Vector Database Status:")
        print(f"   • Current PDF: {self.current_pdf_path or 'None'}")
        print(f"   • Documents loaded: {len(self.documents)}")
        print(f"   • Vector database: {'✓ Created' if self.database_created else '❌ Not created'}")
        
        if self.documents:
            # Show chunk types
            chunk_types = {}
            for doc in self.documents:
                chunk_type = doc.metadata.get('chunk_type', 'standard')
                chunk_types[chunk_type] = chunk_types.get(chunk_type, 0) + 1
            
            print("   • Chunk types:")
            for chunk_type, count in chunk_types.items():
                print(f"     - {chunk_type}: {count} chunks")
                
            print("   • Sample chunks:")
            for i, doc in enumerate(self.documents[:3]):
                preview = doc.page_content.replace('\n', ' ')[:80]
                chunk_type = doc.metadata.get('chunk_type', 'standard')
                print(f"     - Chunk {i+1} ({chunk_type}): {preview}...")

# Initialize the PDF Vector Database
pdf_db = PDFVectorDB()
print("✓ PDF Vector Database initialized with improved chunking!")

✓ PDF Vector Database initialized with improved chunking!


## Load and Process Your PDF

Upload your PDF and create a searchable vector database.


In [17]:
# Load PDF file
pdf_path = "carData.pdf"  # Change this to your PDF path

# Check current status
pdf_db.get_status()

# Load and process the PDF (will skip if already loaded)
pdf_db.load_pdf(pdf_path)

# Create vector database (will skip if already created)
pdf_db.create_vector_database()

📊 PDF Vector Database Status:
   • Current PDF: carData.pdf
   • Documents loaded: 4
   • Vector database: ✓ Created
   • Chunk types:
     - individual_car: 4 chunks
   • Sample chunks:
     - Chunk 1 (individual_car): Toyota Corolla 2020 Mileage: 20,000 km Mileage: 20,000 km Color: Blue Color: Blu...
     - Chunk 2 (individual_car): Honda Civic 2019 Mileage: 35,000 km Mileage: 35,000 km Color: Red Color: Red Pri...
     - Chunk 3 (individual_car): Ford Focus 2021 Mileage: 15,000 km Mileage: 15,000 km Color: White Color: White ...
📖 PDF 'carData.pdf' already loaded with 4 chunks
✓ Vector database already exists with 4 document chunks


In [8]:
# Reset the database and reload with improved chunking
print("🔄 Resetting database for improved chunking...")

# Create a fresh instance
pdf_db = PDFVectorDB()
print("✓ Fresh PDF Vector Database instance created")

# Reload the PDF with better chunking
pdf_path = "carData.pdf"  # Make sure this matches your PDF file

# Load and process with improved chunking
pdf_db.load_pdf(pdf_path)
pdf_db.create_vector_database()

print("🎉 Database reset and reloaded with improved car-specific chunking!")

🔄 Resetting database for improved chunking...
✓ Fresh PDF Vector Database instance created
📖 Loading PDF: carData.pdf
✓ Loaded 1 pages
Page 1 content: Car Inventory Report Toyota Corolla 2020 Mileage: 20,000 km Color: Blue Price: $18,500 Condition: Excellent Honda Civic 2019 Mileage: 35,000 km Color: Red Price: $16,800 Condition: Good Ford Focus 202...
✓ Successfully created 4 car-specific chunks
✓ Created 4 text chunks
Chunk 1 (153 chars): Toyota Corolla 2020 Mileage: 20,000 km Mileage: 20,000 km Color: Blue Color: Blue Price: $18,500 Price: $18,500 Condition: Excellent Condition: Excell...
Chunk 2 (138 chars): Honda Civic 2019 Mileage: 35,000 km Mileage: 35,000 km Color: Red Color: Red Price: $16,800 Price: $16,800 Condition: Good Condition: Good...
Chunk 3 (149 chars): Ford Focus 2021 Mileage: 15,000 km Mileage: 15,000 km Color: White Color: White Price: $19,200 Price: $19,200 Condition: Like New Condition: Like New...
Chunk 4 (149 chars): BMW 320i 2018 Mileage: 45,000 km Mileage: 4

## Search and Ask Questions

Now you can search for specific information or ask questions about your document.


In [None]:
# Example: Ask a question and get an AI-generated answer
answer = pdf_db.answer_question("What is the mileage of BMW 320i 2018?")

❓ Question: What is the mileage of BMW 320i 2018?
📄 Using 4 relevant document chunks for context
🤖 Answer: 45,000 km



In [12]:
# Example: Search for specific content
print("=== Testing Search Functionality ===")
search_results = pdf_db.search("Toyota Corolla", k=1)

=== Testing Search Functionality ===
🔍 Searching for: 'Toyota Corolla'
📊 Database contains 4 total chunks
✓ Found 1 relevant results:
📄 Result 1 (Similarity Score: 0.5406):
Content: Toyota Corolla 2020
Mileage: 20,000 km
Mileage: 20,000 km
Color: Blue
Color: Blue
Price: $18,500
Price: $18,500
Condition: Excellent
Condition: Excellent
📍 Source: Page 0
🏷️ Chunk Type: individual_car
🚗 Car Info: Yes
✅ Query term 'Toyota Corolla' found in content
--------------------------------------------------------------------------------


In [13]:
# Example: Ask another question
answer2 = pdf_db.answer_question("What cars are available and their prices?")

❓ Question: What cars are available and their prices?
📄 Using 4 relevant document chunks for context
🤖 Answer: *   Toyota Corolla 2020: $18,500
*   Honda Civic 2019: $16,800
*   BMW 320i 2018: $24,900
*   Ford Focus 2021: $19,200



## Interactive Q&A Section

Use this section to ask your own questions about the document.


In [14]:
# Ask your own questions here!
# Simply change the question below and run the cell

my_question = "What is the price and condition of Honda Civic?"
my_answer = pdf_db.answer_question(my_question)

❓ Question: What is the price and condition of Honda Civic?
📄 Using 4 relevant document chunks for context
🤖 Answer: Price: $16,800
Condition: Good



In [None]:
# Comprehensive Search Testing with Improved Chunking
print("=== Testing Improved Search Functionality ===")
print("🎯 Testing with better car-specific chunks for precise results\n")

# Search for BMW (should find BMW-specific chunk)
print("1️⃣ Searching for 'BMW':")
search_results_bmw = pdf_db.search("BMW", k=1)
print()

print("🎯 Search testing completed! Each search should now return different, relevant chunks.")

=== Testing Improved Search Functionality ===
🎯 Testing with better car-specific chunks for precise results

1️⃣ Searching for 'BMW':
🔍 Searching for: 'BMW'
📊 Database contains 4 total chunks
✓ Found 1 relevant results:
📄 Result 1 (Similarity Score: 0.7498):
Content: BMW 320i 2018
Mileage: 45,000 km
Mileage: 45,000 km
Color: Black
Color: Black
Price: $24,900
Price: $24,900
Condition: Very Good
Condition: Very Good
📍 Source: Page 0
🏷️ Chunk Type: individual_car
🚗 Car Info: Yes
✅ Query term 'BMW' found in content
--------------------------------------------------------------------------------

🎯 Search testing completed! Each search should now return different, relevant chunks.


## Save/Load Database

Save your vector database for future use or load an existing one.


In [18]:
# Save the current vector database
# (This is automatically done when creating the database, but you can save with a custom name)
# pdf_db.vector_store.save_local("my_custom_db_name")

# Load an existing vector database
# pdf_db.load_existing_database("my_custom_db_name")

print("💾 Database save/load functions available!")
print("Uncomment the lines above to save or load a custom database.")

💾 Database save/load functions available!
Uncomment the lines above to save or load a custom database.
