# 🎓 Learning GenAI: Building a Semantic Search Engine
## *From GCP Data Engineer to GenAI Developer*

Welcome to your GenAI learning journey! This notebook will teach you how to build a **Semantic Search Engine** step by step.

### 🧠 **What You'll Learn:**
1. **RAG (Retrieval-Augmented Generation)** - How to make AI smarter with your own data
2. **Vector Embeddings** - Converting text into mathematical representations
3. **Semantic Search** - Finding meaning, not just keywords
4. **LangChain Framework** - The toolkit for GenAI applications
5. **Document Processing** - From raw files to AI-ready chunks

### 🏗️ **What We're Building:**
A system that can:
- 📚 Read your documents from GCS (like your data pipelines!)
- 🧮 Convert text into vectors (embeddings)
- 🔍 Find relevant information semantically 
- 🤖 Answer questions using Gemini 2.5 Pro
- 📝 Show sources (like good data lineage!)

### 💡 **Key Concepts We'll Explore:**
- **Embeddings**: Text → Numbers that capture meaning
- **Vector Database**: Storage optimized for similarity search
- **Chunking**: Breaking documents into digestible pieces
- **Retrieval**: Finding relevant context for questions
- **Generation**: Using LLM to create answers from context


## 📦 Step 1: Understanding & Installing Dependencies

Before we build our GenAI application, let's understand what each tool does:

### 🔧 **Core GenAI Stack:**
- **`langchain`**: The main framework - think of it as pandas for GenAI
- **`langchain-google-vertexai`**: Connects LangChain to Google's AI models
- **`sentence-transformers`**: Creates embeddings (text → vectors)
- **`chromadb`**: Vector database for similarity search
- **`PyPDF2`, `python-docx`**: Document parsers (like reading CSV/JSON files)

### 🤔 **Why These Specific Packages?**
- **LangChain**: Simplifies complex GenAI workflows (like Apache Beam for data)
- **Embeddings**: Convert text to numbers so computers can understand similarity
- **Vector DB**: Specialized storage for finding "similar" items quickly
- **Document Loaders**: Handle different file formats automatically

Let's install everything with compatible versions:


In [5]:
# 📦 Installing the GenAI Toolkit
print("🔧 Installing packages for our GenAI application...")
print("This is like setting up your data engineering environment, but for AI!")

# Core LangChain framework and Google integrations
%pip install -qU langchain langchain-google-vertexai langchain-community langchain-text-splitters

# Vector storage and embeddings
%pip install -qU "google-cloud-storage<3.0.0,>=2.18.0" chromadb sentence-transformers

# Document processing utilities
%pip install -qU python-dotenv PyPDF2 python-docx unstructured tiktoken faiss-cpu

print("\n✅ Installation complete!")
print("🎉 You now have a complete GenAI development environment!")
print("\n📚 What we just installed:")
print("   • LangChain: GenAI application framework")
print("   • ChromaDB: Vector database for semantic search") 
print("   • Sentence Transformers: Text embedding models")
print("   • Document parsers: PDF, DOCX, TXT support")
print("   • GCS integration: Connect to your cloud storage")


🔧 Installing packages for our GenAI application...
This is like setting up your data engineering environment, but for AI!

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade p

In [6]:
# 📚 Importing Our GenAI Toolkit
print("📚 Importing libraries - let me explain what each does...")

# Standard Python libraries (you know these!)
import os
import tempfile
from typing import List, Dict, Any
from pathlib import Path
print("✅ Standard Python utilities imported")

# 🦜 LangChain Core Components
print("\n🦜 Importing LangChain components:")
from langchain.chat_models import init_chat_model  # Initialize AI models
print("   • init_chat_model: Connects to Gemini 2.5 Pro")

from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader, TextLoader
print("   • Document loaders: Read PDF, DOCX, TXT files")

from langchain_text_splitters import RecursiveCharacterTextSplitter
print("   • Text splitter: Breaks documents into chunks")

from langchain_community.embeddings import HuggingFaceEmbeddings
print("   • Embeddings: Convert text to vectors")

from langchain_community.vectorstores import Chroma
print("   • ChromaDB: Vector database for similarity search")

from langchain.chains import RetrievalQA
print("   • RetrievalQA: Combines search + question answering")

from langchain_core.prompts import PromptTemplate
print("   • PromptTemplate: Structure how we ask the AI")

# ☁️ Google Cloud Integration
print("\n☁️ Google Cloud components:")
from google.cloud import storage
print("   • GCS Storage: Access your cloud documents")

# Clean up warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("\n🎉 All imports successful!")
print("💡 Think of LangChain as your 'pandas for GenAI' - it handles the complex stuff!")


📚 Importing libraries - let me explain what each does...
✅ Standard Python utilities imported

🦜 Importing LangChain components:
   • init_chat_model: Connects to Gemini 2.5 Pro
   • Document loaders: Read PDF, DOCX, TXT files
   • Text splitter: Breaks documents into chunks
   • Embeddings: Convert text to vectors
   • ChromaDB: Vector database for similarity search
   • RetrievalQA: Combines search + question answering
   • PromptTemplate: Structure how we ask the AI

☁️ Google Cloud components:
   • GCS Storage: Access your cloud documents

🎉 All imports successful!
💡 Think of LangChain as your 'pandas for GenAI' - it handles the complex stuff!


## ⚙️ Step 2: Configuration - Setting Up Your GenAI Pipeline

Just like configuring a data pipeline, we need to set parameters for our GenAI application.

### 🎯 **Configuration Concepts:**
- **Chunk Size**: How big should text pieces be? (like batch size in data processing)
- **Overlap**: How much text to share between chunks (ensures continuity)
- **Model Temperature**: How creative should AI responses be? (0=precise, 1=creative)
- **Similarity Search K**: How many relevant documents to retrieve?

### 🔧 **Think of This As:**
- Spark configuration for distributed processing
- Database connection settings  
- ETL pipeline parameters

Let's configure our system:


In [7]:
# ⚙️ GenAI Configuration Class
print("⚙️ Setting up configuration - like your data pipeline configs!")

class Config:
    """Configuration for our GenAI Semantic Search Engine"""
    
    # 🏗️ Infrastructure Settings (like your GCP project setup)
    print("🏗️ Infrastructure configuration...")
    GCS_BUCKET_NAME = "genai-sai-bucket"  # 🪣 Your data storage
    GOOGLE_CLOUD_PROJECT = "igneous-future-451513-v8d"  # 🌍 Your GCP project
    GCS_FOLDER_PATH = ""  # 📁 Optional: specific folder (like partitions!)
    
    # 🤖 AI Model Settings
    print("🤖 AI model configuration...")
    MODEL_NAME = "gemini-2.5-pro"  # The LLM brain
    MODEL_PROVIDER = "google_vertexai"  # Google's AI platform
    TEMPERATURE = 0.0  # 🌡️ 0=factual, 1=creative (like randomness in ML)
    
    # 📄 Document Processing (like ETL transformations)
    print("📄 Document processing configuration...")
    CHUNK_SIZE = 1000  # 📏 Characters per chunk (like batch size)
    CHUNK_OVERLAP = 200  # 🔄 Overlap between chunks (ensures continuity)
    
    # 🗄️ Vector Database Settings (like your data warehouse config)
    print("🗄️ Vector storage configuration...")
    VECTOR_STORE_PERSIST_DIRECTORY = "./vector_store"  # 💾 Local storage
    VECTOR_STORE_COLLECTION_NAME = "documents"  # 🏷️ Table name equivalent
    
    # 🧮 Embedding Model (converts text → numbers)
    print("🧮 Embedding model configuration...")
    EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"  # Lightweight & fast
    
    # 🔍 Search Parameters
    print("🔍 Search configuration...")
    SIMILARITY_SEARCH_K = 4  # How many relevant docs to retrieve
    SIMILARITY_SCORE_THRESHOLD = 0.7  # Minimum relevance score

# Initialize our configuration
config = Config()

print("\n✅ Configuration Complete!")
print(f"📊 Configuration Summary:")
print(f"   🪣 Data Source: {config.GCS_BUCKET_NAME}")
print(f"   🤖 AI Model: {config.MODEL_NAME}")
print(f"   🧮 Embedding Model: {config.EMBEDDING_MODEL}")
print(f"   📏 Chunk Size: {config.CHUNK_SIZE} characters")
print(f"   🔍 Retrieval Count: {config.SIMILARITY_SEARCH_K} documents")

print(f"\n💡 Pro Tip: These settings affect performance and accuracy!")
print(f"   • Larger chunks = more context but slower processing")
print(f"   • Higher K = more context but potential noise")
print(f"   • Temperature 0 = factual, higher = creative")


⚙️ Setting up configuration - like your data pipeline configs!
🏗️ Infrastructure configuration...
🤖 AI model configuration...
📄 Document processing configuration...
🗄️ Vector storage configuration...
🧮 Embedding model configuration...
🔍 Search configuration...

✅ Configuration Complete!
📊 Configuration Summary:
   🪣 Data Source: genai-sai-bucket
   🤖 AI Model: gemini-2.5-pro
   🧮 Embedding Model: sentence-transformers/all-MiniLM-L6-v2
   📏 Chunk Size: 1000 characters
   🔍 Retrieval Count: 4 documents

💡 Pro Tip: These settings affect performance and accuracy!
   • Larger chunks = more context but slower processing
   • Higher K = more context but potential noise
   • Temperature 0 = factual, higher = creative


## 📚 Step 3: Building a Document Loader - Your First GenAI Component

Let's build our first GenAI component! This is like creating a data ingestion pipeline, but for AI.

### 🔍 **What This Component Does:**
1. **Connects to GCS** (like connecting to BigQuery)
2. **Lists available documents** (like scanning table partitions)
3. **Downloads files to local temp storage** (like caching data)
4. **Parses different formats** (PDF, DOCX, TXT - like handling CSV, JSON, Parquet)
5. **Adds metadata** (source tracking - like data lineage!)

### 🎯 **Learning Objectives:**
- Understand document ingestion patterns in GenAI
- See how LangChain handles different file formats
- Learn about temporary file management
- Practice error handling in GenAI pipelines

### 💭 **Data Engineering Parallels:**
- This is like an ETL extract phase
- File format handling = data schema validation
- Metadata addition = data cataloging
- Error handling = data quality checks


In [8]:
# 📚 Building Our Document Loader Class
print("🏗️ Creating our first GenAI component - Document Loader!")
print("This is like building a data ingestion pipeline for AI...")

class GCSDocumentLoader:
    """
    📚 Document Loader - Your GenAI Data Ingestion Pipeline
    
    This class handles the 'Extract' part of our GenAI ETL process:
    • Connects to GCS (your data lake)
    • Downloads documents (data ingestion)
    • Parses different formats (schema handling)
    • Adds metadata (data lineage)
    """
    
    def __init__(self, bucket_name: str, project_id: str, folder_path: str = ""):
        print(f"🔧 Initializing Document Loader...")
        self.bucket_name = bucket_name
        self.project_id = project_id
        self.folder_path = folder_path
        self.client = None  # Will be initialized when needed
        print(f"   📍 Target: gs://{bucket_name}/{folder_path}")
        
    def _initialize_client(self):
        """🔑 Initialize GCS client - like connecting to your data warehouse"""
        print("🔑 Connecting to Google Cloud Storage...")
        try:
            # 🗝️ Use service account authentication (secure!)
            credentials_path = "genai-sai-auth.json"
            print(f"   📋 Using credentials: {credentials_path}")
            
            from google.oauth2 import service_account
            credentials = service_account.Credentials.from_service_account_file(credentials_path)
            
            # Initialize the GCS client
            self.client = storage.Client(project=self.project_id, credentials=credentials)
            print(f"✅ Connected to GCS project: {self.project_id}")
            print(f"🔐 Authentication successful!")
            
        except Exception as e:
            print(f"❌ Connection failed: {str(e)}")
            print("💡 Check: genai-sai-auth.json exists and has proper permissions")
            raise
    
    def list_documents(self) -> List[str]:
        """📋 Discover available documents - like scanning your data catalog"""
        print("📋 Scanning for documents in GCS bucket...")
        
        if not self.client:
            self._initialize_client()
        
        assert self.client is not None, "Failed to initialize GCS client"
        
        # 🪣 Access the bucket (like connecting to a database)
        bucket = self.client.bucket(self.bucket_name)
        blobs = bucket.list_blobs(prefix=self.folder_path)
        
        # 🔍 Filter for supported document types
        document_files = []
        supported_extensions = ['.pdf', '.docx', '.txt']
        print(f"   🎯 Looking for: {', '.join(supported_extensions)} files")
        
        for blob in blobs:
            if any(blob.name.lower().endswith(ext) for ext in supported_extensions):
                document_files.append(blob.name)
                print(f"   📄 Found: {blob.name}")
        
        print(f"✅ Discovery complete: {len(document_files)} documents found")
        return document_files
    
    def download_and_load_documents(self) -> List:
        """
        📥 Download & Parse Documents - The core ingestion process
        
        This is like your ETL extract + transform phases:
        1. Download from cloud storage
        2. Parse different file formats  
        3. Extract text content
        4. Add metadata for tracking
        """
        print("📥 Starting document ingestion pipeline...")
        
        if not self.client:
            self._initialize_client()
        
        assert self.client is not None, "Failed to initialize GCS client"
        
        # Initialize containers
        documents = []
        bucket = self.client.bucket(self.bucket_name)
        document_files = self.list_documents()
        
        if not document_files:
            print("⚠️ No documents found to process!")
            return documents
            
        print(f"🚀 Processing {len(document_files)} documents...")
        
        # Process each document
        for i, file_name in enumerate(document_files, 1):
            print(f"\n📄 Processing {i}/{len(document_files)}: {file_name}")
            
            try:
                # 💾 Create temporary file (like staging area in ETL)
                file_suffix = Path(file_name).suffix
                print(f"   📁 Creating temp file with suffix: {file_suffix}")
                
                with tempfile.NamedTemporaryFile(delete=False, suffix=file_suffix) as temp_file:
                    # ⬇️ Download from GCS
                    print(f"   ⬇️ Downloading from GCS...")
                    blob = bucket.blob(file_name)
                    blob.download_to_filename(temp_file.name)
                    print(f"   ✅ Downloaded to: {temp_file.name}")
                    
                    # 🔧 Choose appropriate parser based on file type
                    print(f"   🔧 Selecting parser for: {file_suffix}")
                    if file_name.lower().endswith('.pdf'):
                        loader = PyPDFLoader(temp_file.name)
                        print("   📖 Using PDF parser")
                    elif file_name.lower().endswith('.docx'):
                        loader = Docx2txtLoader(temp_file.name)
                        print("   📝 Using DOCX parser")
                    elif file_name.lower().endswith('.txt'):
                        loader = TextLoader(temp_file.name)
                        print("   📄 Using TXT parser")
                    else:
                        print(f"   ⚠️ Unsupported file type: {file_suffix}")
                        continue
                    
                    # 📖 Parse the document content
                    print("   📖 Parsing document content...")
                    doc_content = loader.load()
                    print(f"   📊 Extracted {len(doc_content)} pages/sections")
                    
                    # 🏷️ Add metadata (like data lineage!)
                    print("   🏷️ Adding metadata...")
                    for doc in doc_content:
                        doc.metadata['source'] = file_name  # Original filename
                        doc.metadata['bucket'] = self.bucket_name  # Source bucket
                        doc.metadata['file_type'] = file_suffix  # File format
                        doc.metadata['processed_at'] = str(os.path.getmtime(temp_file.name))
                    
                    documents.extend(doc_content)
                    print(f"   ✅ Successfully processed: {file_name}")
                    
                # 🧹 Clean up temporary file (good housekeeping!)
                os.unlink(temp_file.name)
                print(f"   🧹 Cleaned up temp file")
                
            except Exception as e:
                print(f"   ❌ Error processing {file_name}: {str(e)}")
                print(f"   🔄 Continuing with next file...")
                continue
        
        print(f"\n🎉 Ingestion Complete!")
        print(f"   📊 Total documents processed: {len(document_files)}")
        print(f"   📄 Total text chunks extracted: {len(documents)}")
        print(f"   🎯 Ready for the next pipeline stage!")
        
        return documents

print("✅ GCS Document Loader class ready!")
print("💡 This is your 'Extract' component in the GenAI ETL pipeline")
print("🎯 Next: We'll build the 'Transform' and 'Load' components!")


🏗️ Creating our first GenAI component - Document Loader!
This is like building a data ingestion pipeline for AI...
✅ GCS Document Loader class ready!
💡 This is your 'Extract' component in the GenAI ETL pipeline
🎯 Next: We'll build the 'Transform' and 'Load' components!


In [11]:
### TEST 
# config = Config()
# gcsloader = GCSDocumentLoader(bucket_name=config.GCS_BUCKET_NAME, project_id=config.GOOGLE_CLOUD_PROJECT, folder_path=config.GCS_FOLDER_PATH)
# documents = gcsloader.download_and_load_documents()
# print(f"✅ Documents loaded: {len(documents)}")


## 🧮 Step 4: The Magic of Embeddings & Vector Search - The Heart of GenAI

Now we're getting to the exciting part! Let's understand how AI actually "understands" your documents.

### 🧠 **Core Concepts to Learn:**

#### 📏 **Text Embeddings (Text → Numbers):**
- Convert words to vectors: "Hello" → [0.1, 0.8, -0.3, ...]
- Similar words have similar vectors
- Like GPS coordinates, but for meaning!

#### 🔍 **Vector Database:**
- Stores these number representations
- Finds "similar" vectors super fast
- Like a search index, but for meaning instead of exact words

#### ✂️ **Document Chunking:**
- Split long documents into smaller pieces
- Balance: Big chunks = more context, Small chunks = precise answers
- Like pagination in databases

### 🎯 **The RAG (Retrieval-Augmented Generation) Pattern:**
1. **Split** documents into chunks
2. **Convert** chunks to embeddings (vectors)
3. **Store** in vector database
4. **Search** for relevant chunks when user asks question
5. **Generate** answer using retrieved context + LLM

Let's build this step by step!


In [12]:
# 🧮 Building the Semantic Search Engine - The Brain of Our System
print("🧠 Building the Semantic Search Engine - This is where the magic happens!")
print("We're creating the 'Transform' and 'Load' parts of our GenAI ETL pipeline...")

class SemanticSearchEngine:
    """
    🧮 Semantic Search Engine - The GenAI Processing Brain
    
    This class handles the core GenAI transformations:
    • Text chunking (data preprocessing)
    • Embeddings generation (feature engineering) 
    • Vector storage (specialized database)
    • Similarity search (intelligent querying)
    • Question answering (the final output)
    """
    
    def __init__(self, config):
        print("🔧 Initializing Semantic Search Engine...")
        self.config = config
        
        # Initialize all components as None (lazy loading)
        self.embeddings = None      # Text → Vector converter
        self.vector_store = None    # Vector database
        self.retriever = None       # Search interface
        self.model = None          # LLM (Gemini 2.5 Pro)
        self.qa_chain = None       # Complete QA pipeline
        
        print("   📊 Engine initialized with lazy loading pattern")
        print("   🎯 Components will be created when needed")
        
    def initialize_embeddings(self):
        """
        🧮 Initialize Embedding Model - Converting Text to Vectors
        
        This is like feature engineering in ML:
        • Converts text to numerical vectors
        • Captures semantic meaning
        • Enables similarity calculations
        """
        print("🧮 Setting up embedding model...")
        print("   📐 This converts text to vectors (numbers that capture meaning)")
        
        self.embeddings = HuggingFaceEmbeddings(
            model_name=self.config.EMBEDDING_MODEL,
            model_kwargs={'device': 'cpu'}  # Use CPU for compatibility
        )
        
        print(f"✅ Embedding model ready: {self.config.EMBEDDING_MODEL}")
        print("   💡 Now we can convert any text to vectors!")
        print("   📊 Example: 'Hello' → [0.1, 0.8, -0.3, ...384 numbers]")
    
    def process_documents(self, documents):
        """
        ✂️ Document Chunking - Breaking Text into Digestible Pieces
        
        This is like data preprocessing:
        • Split long documents into smaller chunks
        • Maintain overlap for context continuity
        • Optimize for both accuracy and performance
        """
        print("✂️ Starting document chunking process...")
        print("   📏 Breaking documents into optimal-sized pieces")
        
        # Create the text splitter with intelligent chunking
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.config.CHUNK_SIZE,        # Max characters per chunk
            chunk_overlap=self.config.CHUNK_OVERLAP,  # Overlap between chunks
            length_function=len,                      # How to measure length
        )
        
        print(f"   ⚙️ Chunking parameters:")
        print(f"      📏 Chunk size: {self.config.CHUNK_SIZE} characters") 
        print(f"      🔄 Overlap: {self.config.CHUNK_OVERLAP} characters")
        print(f"   🔄 Processing {len(documents)} documents...")
        
        # Split the documents
        split_documents = text_splitter.split_documents(documents)
        
        # Show some statistics
        avg_chunk_size = sum(len(doc.page_content) for doc in split_documents) / len(split_documents)
        
        print(f"✅ Chunking complete!")
        print(f"   📊 Input: {len(documents)} documents")
        print(f"   📄 Output: {len(split_documents)} chunks")
        print(f"   📏 Average chunk size: {avg_chunk_size:.0f} characters")
        print(f"   🎯 Ready for embedding generation!")
        
        return split_documents
    
    def create_vector_store(self, documents):
        """
        🗄️ Create Vector Database - The Heart of Semantic Search
        
        This is the 'Load' phase of our GenAI ETL:
        • Convert text chunks to embeddings
        • Store in specialized vector database
        • Create search interface for similarity queries
        """
        print("🗄️ Creating vector database...")
        print("   🧮 This will convert all text to vectors and store them")
        
        # Initialize embeddings if not already done
        if not self.embeddings:
            print("   🔧 Embedding model not initialized, setting it up...")
            self.initialize_embeddings()
            
        print("   📊 Converting documents to vectors...")
        print("   ⏳ This might take a moment - we're doing math on every piece of text!")
        
        # Create ChromaDB vector store
        self.vector_store = Chroma.from_documents(
            documents=documents,
            embedding=self.embeddings,
            collection_name=self.config.VECTOR_STORE_COLLECTION_NAME,
            persist_directory=self.config.VECTOR_STORE_PERSIST_DIRECTORY
        )
        
        print("   🔍 Setting up similarity search interface...")
        
        # Create retriever for similarity search
        self.retriever = self.vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": self.config.SIMILARITY_SEARCH_K}
        )
        
        print(f"✅ Vector store created successfully!")
        print(f"   📊 Stored {len(documents)} document chunks as vectors")
        print(f"   🔍 Search configured to return top {self.config.SIMILARITY_SEARCH_K} matches")
        print(f"   💾 Persisted to: {self.config.VECTOR_STORE_PERSIST_DIRECTORY}")
        print(f"   🎯 Ready for semantic search!")
        
        return self.vector_store
    
    def initialize_model(self):
        """
        🤖 Initialize Gemini 2.5 Pro - The Language Understanding Brain
        
        This sets up our Large Language Model:
        • Connects to Google's Gemini 2.5 Pro
        • Configures response style (temperature)
        • Prepares for question answering
        """
        print("🤖 Connecting to Gemini 2.5 Pro...")
        print("   🧠 This is Google's most advanced language model")
        
        self.model = init_chat_model(
            model=self.config.MODEL_NAME,
            model_provider=self.config.MODEL_PROVIDER,
            temperature=self.config.TEMPERATURE
        )
        
        print(f"✅ Gemini 2.5 Pro connected successfully!")
        print(f"   🌡️ Temperature: {self.config.TEMPERATURE} (0=factual, 1=creative)")
        print(f"   🎯 Ready to answer questions!")
        
        return self.model
    
    def create_qa_chain(self):
        """
        🔗 Create Question-Answering Pipeline - Putting It All Together
        
        This combines everything into a complete RAG system:
        • Retrieval: Find relevant documents
        • Augmentation: Add context to the question  
        • Generation: Generate answer using LLM
        """
        print("🔗 Building Question-Answering pipeline...")
        print("   🎯 This combines search + AI to answer questions")
        
        # Initialize model if needed
        if not self.model:
            print("   🤖 LLM not initialized, setting it up...")
            self.initialize_model()
            
        if not self.retriever:
            raise ValueError("❌ Vector store must be created first!")
        
        print("   📝 Creating prompt template...")
        
        # Design the prompt template (how we ask the AI)
        prompt_template = """You are a helpful AI assistant. Use the following pieces of context to answer the question. 
        
If you don't know the answer based on the provided context, say that you don't know. Don't make up information.

Context Information:
{context}

Question: {question}

Helpful Answer:"""
        
        prompt = PromptTemplate(
            template=prompt_template,
            input_variables=["context", "question"]
        )
        
        print("   🔗 Connecting all components...")
        
        # Create the complete RAG chain
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=self.model,                           # Language model
            chain_type="stuff",                       # How to combine docs
            retriever=self.retriever,                 # Document retriever
            chain_type_kwargs={"prompt": prompt},     # Custom prompt
            return_source_documents=True              # Include source info
        )
        
        print("✅ Question-Answering pipeline ready!")
        print("   🔍 Can now search your documents")
        print("   🤖 Can answer questions using AI") 
        print("   📚 Will show source documents")
        print("   🎉 Your GenAI system is complete!")
        
        return self.qa_chain
    
    def search_documents(self, query: str, k: int = 4):
        """
        🔍 Semantic Document Search - Find Similar Content
        
        This performs similarity search without AI generation:
        • Convert query to vector
        • Find most similar document chunks
        • Return ranked results
        """
        if not self.vector_store:
            raise ValueError("❌ Vector store must be created first!")
            
        k = k if k > 0 else self.config.SIMILARITY_SEARCH_K
        print(f"🔍 Searching for documents similar to: '{query}'")
        print(f"   📊 Looking for top {k} matches...")
        
        results = self.vector_store.similarity_search(query, k=k)
        
        print(f"✅ Found {len(results)} similar documents")
        return results
    
    def ask_question(self, question: str):
        """
        🤔 Ask Questions About Your Documents - The Complete RAG Experience
        
        This is the full RAG (Retrieval-Augmented Generation) process:
        1. Find relevant documents (Retrieval)
        2. Add them as context (Augmentation)  
        3. Generate answer with AI (Generation)
        """
        if not self.qa_chain:
            raise ValueError("❌ QA chain must be created first!")
        
        print(f"🤔 Processing question: '{question}'")
        print("🔍 Step 1: Searching for relevant information...")
        print("🤖 Step 2: Generating AI response...")
        print("📚 Step 3: Compiling sources...")
        
        # Run the complete RAG pipeline
        result = self.qa_chain({"query": question})
        
        answer = result["result"]
        source_docs = result["source_documents"]
        
        print(f"\n✅ Question answering complete!")
        print(f"💡 Answer generated using {len(source_docs)} source documents")
        
        return {
            "question": question,
            "answer": answer,
            "source_documents": source_docs
        }

print("✅ Semantic Search Engine class complete!")
print("🎉 You've just built a complete RAG (Retrieval-Augmented Generation) system!")
print("💡 This combines the best of search engines and AI language models")
print("🎯 Next: Let's put it all together and test it!")


🧠 Building the Semantic Search Engine - This is where the magic happens!
We're creating the 'Transform' and 'Load' parts of our GenAI ETL pipeline...
✅ Semantic Search Engine class complete!
🎉 You've just built a complete RAG (Retrieval-Augmented Generation) system!
💡 This combines the best of search engines and AI language models
🎯 Next: Let's put it all together and test it!


## 🚀 Step 5: Putting It All Together - Building Your First GenAI Application!

This is the exciting part! We'll now assemble all our components into a working GenAI system.

### 🎯 **What We're About to Do:**
1. **Load documents** from your GCS bucket (Extract)
2. **Process and chunk** the text (Transform)  
3. **Create vector embeddings** (Feature Engineering)
4. **Build the vector database** (Load)
5. **Initialize the AI model** (Deploy)
6. **Create the QA pipeline** (Production Ready!)

### 💡 **Learning Focus:**
- See how all GenAI components work together
- Understand the complete data flow
- Watch the RAG pipeline in action
- Learn about performance considerations

### 🔗 **The GenAI Pipeline Flow:**
```
Documents (GCS) → Chunking → Embeddings → Vector DB → Retrieval → LLM → Answer
```

Ready to see your GenAI system come to life? Let's go!


In [13]:
# 🚀 BUILDING YOUR GENAI SYSTEM - Step by Step!
print("🎉 Welcome to GenAI System Assembly!")
print("Let's build your semantic search engine step by step...")
print("🎯 This is like deploying a complete data pipeline, but for AI!")

print("\n" + "🔧" + " INITIALIZATION " + "🔧")
print("Creating the main engine instance...")
search_engine = SemanticSearchEngine(config)

print("\n" + "="*60)
print("📚 STEP 1: DOCUMENT INGESTION (Extract Phase)")
print("="*60)
print("🎯 Learning Goal: Understand how GenAI systems ingest data")
print("💡 This is like your ETL extract phase - getting data from source")

print(f"\n🪣 Connecting to GCS bucket: {config.GCS_BUCKET_NAME}")
document_loader = GCSDocumentLoader(
    bucket_name=config.GCS_BUCKET_NAME,
    project_id=config.GOOGLE_CLOUD_PROJECT,
    folder_path=config.GCS_FOLDER_PATH
)

print("\n📥 Starting document ingestion...")
print("⏳ This process will:")
print("   1. Connect to your GCS bucket")
print("   2. Scan for supported file types")
print("   3. Download and parse each document")
print("   4. Extract text content")
print("   5. Add metadata for tracking")

documents = document_loader.download_and_load_documents()

print(f"\n📊 INGESTION RESULTS:")
print(f"   ✅ Documents loaded: {len(documents)}")
if documents:
    sources = list(set([doc.metadata.get('source', 'Unknown') for doc in documents[:3]]))
    print(f"   📄 Sample sources: {sources}")
    total_chars = sum(len(doc.page_content) for doc in documents)
    print(f"   📏 Total text characters: {total_chars:,}")
    print(f"   💾 Average document size: {total_chars//len(documents):,} characters")
else:
    print("   ⚠️ No documents found - check your bucket configuration!")
    
print(f"\n💡 What just happened:")
print(f"   • Your documents are now in memory as text objects")
print(f"   • Each has metadata (source file, bucket info)")
print(f"   • Ready for the next stage: chunking!")


🎉 Welcome to GenAI System Assembly!
Let's build your semantic search engine step by step...
🎯 This is like deploying a complete data pipeline, but for AI!

🔧 INITIALIZATION 🔧
Creating the main engine instance...
🔧 Initializing Semantic Search Engine...
   📊 Engine initialized with lazy loading pattern
   🎯 Components will be created when needed

📚 STEP 1: DOCUMENT INGESTION (Extract Phase)
🎯 Learning Goal: Understand how GenAI systems ingest data
💡 This is like your ETL extract phase - getting data from source

🪣 Connecting to GCS bucket: genai-sai-bucket
🔧 Initializing Document Loader...
   📍 Target: gs://genai-sai-bucket/

📥 Starting document ingestion...
⏳ This process will:
   1. Connect to your GCS bucket
   2. Scan for supported file types
   3. Download and parse each document
   4. Extract text content
   5. Add metadata for tracking
📥 Starting document ingestion pipeline...
🔑 Connecting to Google Cloud Storage...
   📋 Using credentials: genai-sai-auth.json
✅ Connected to GCS p

In [14]:
print("\n" + "="*60)
print("✂️ STEP 2: DOCUMENT PROCESSING (Transform Phase)")
print("="*60)
print("🎯 Learning Goal: Understand text chunking and why it matters")
print("💡 This is like data preprocessing - preparing data for optimal AI processing")

if documents:
    print(f"\n📏 About to process {len(documents)} documents")
    print("🔍 Why do we need chunking?")
    print("   • AI models have context limits (like memory constraints)")
    print("   • Smaller chunks = more precise retrieval")
    print("   • Overlap ensures we don't lose context at boundaries")
    
    print(f"\n⚙️ Chunking configuration:")
    print(f"   📏 Chunk size: {config.CHUNK_SIZE} characters")
    print(f"   🔄 Overlap: {config.CHUNK_OVERLAP} characters")
    print(f"   💡 This balances precision vs context!")
    
    # Process documents (split into chunks)
    processed_docs = search_engine.process_documents(documents)
    
    print("\n" + "="*60)
    print("🧮 STEP 3: VECTOR EMBEDDING CREATION (Feature Engineering)")
    print("="*60)
    print("🎯 Learning Goal: Understand how text becomes searchable vectors")
    print("💡 This converts human language to mathematical representations")
    
    print("\n🔮 The embedding magic:")
    print("   📝 'Machine learning' → [0.2, -0.1, 0.8, ...384 numbers]")
    print("   📝 'AI algorithms' → [0.3, -0.2, 0.7, ...384 numbers]")
    print("   🎯 Similar concepts get similar numbers!")
    
    print(f"\n🗄️ Creating vector database with {len(processed_docs)} chunks...")
    print("⏳ This will take a moment - we're doing math on every piece of text!")
    
    # Create vector store
    vector_store = search_engine.create_vector_store(processed_docs)
    
    print(f"\n📊 PROCESSING COMPLETE!")
    print(f"   📚 Original documents: {len(documents)}")
    print(f"   ✂️ Text chunks created: {len(processed_docs)}")
    print(f"   🧮 Vector embeddings: {len(processed_docs)} (one per chunk)")
    print(f"   🗄️ Vector database: ✅ Ready for semantic search!")
    
    print(f"\n💡 What you now have:")
    print(f"   • Every piece of text is now a vector in {config.EMBEDDING_MODEL.split('/')[-1]}")
    print(f"   • You can find 'similar' content mathematically")
    print(f"   • The AI can search by meaning, not just keywords!")
    
else:
    print("❌ No documents found. Please check your GCS bucket and configuration.")



✂️ STEP 2: DOCUMENT PROCESSING (Transform Phase)
🎯 Learning Goal: Understand text chunking and why it matters
💡 This is like data preprocessing - preparing data for optimal AI processing

📏 About to process 18 documents
🔍 Why do we need chunking?
   • AI models have context limits (like memory constraints)
   • Smaller chunks = more precise retrieval
   • Overlap ensures we don't lose context at boundaries

⚙️ Chunking configuration:
   📏 Chunk size: 1000 characters
   🔄 Overlap: 200 characters
   💡 This balances precision vs context!
✂️ Starting document chunking process...
   📏 Breaking documents into optimal-sized pieces
   ⚙️ Chunking parameters:
      📏 Chunk size: 1000 characters
      🔄 Overlap: 200 characters
   🔄 Processing 18 documents...
✅ Chunking complete!
   📊 Input: 18 documents
   📄 Output: 81 chunks
   📏 Average chunk size: 879 characters
   🎯 Ready for embedding generation!

🧮 STEP 3: VECTOR EMBEDDING CREATION (Feature Engineering)
🎯 Learning Goal: Understand how tex

In [15]:
print("\n" + "="*60)
print("🤖 STEP 4: AI MODEL INTEGRATION (The Brain)")
print("="*60)
print("🎯 Learning Goal: Understand how LLMs integrate with search")
print("💡 This connects Google's most advanced AI to your data")

if search_engine.vector_store:
    print(f"\n🧠 Connecting to {config.MODEL_NAME}...")
    print("🌟 What makes Gemini 2.5 Pro special:")
    print("   • Understands context and nuance")
    print("   • Can reason across multiple documents")
    print("   • Generates human-like responses")
    print("   • Knows when it doesn't know something")
    
    print(f"\n🔗 Building the complete RAG pipeline...")
    print("📋 RAG = Retrieval-Augmented Generation")
    print("   1. 🔍 Retrieve: Find relevant documents")
    print("   2. 📝 Augment: Add context to the question")
    print("   3. 🤖 Generate: Create AI-powered answer")
    
    # Initialize the model and create QA chain
    qa_chain = search_engine.create_qa_chain()
    
    print(f"\n🎉 YOUR GENAI SYSTEM IS READY!")
    print("="*40)
    print(f"   🤖 AI Model: {config.MODEL_NAME} ✅")
    print(f"   📚 Document Store: {len(processed_docs)} chunks ✅") 
    print(f"   🧮 Vector Database: ChromaDB ✅")
    print(f"   🔗 QA Pipeline: Complete RAG system ✅")
    print(f"   🎯 Ready to answer questions! 🚀")
    
    print(f"\n💡 What you've built:")
    print(f"   • A complete GenAI application from scratch")
    print(f"   • Enterprise-grade document search system")
    print(f"   • AI that can understand and answer questions about YOUR data")
    print(f"   • Production-ready RAG (Retrieval-Augmented Generation) pipeline")
    
    print(f"\n🚀 Next: Let's test your creation!")
    
else:
    print("❌ Vector store not available. Cannot create QA chain.")
    print("💡 Make sure Step 2 completed successfully.")



🤖 STEP 4: AI MODEL INTEGRATION (The Brain)
🎯 Learning Goal: Understand how LLMs integrate with search
💡 This connects Google's most advanced AI to your data

🧠 Connecting to gemini-2.5-pro...
🌟 What makes Gemini 2.5 Pro special:
   • Understands context and nuance
   • Can reason across multiple documents
   • Generates human-like responses
   • Knows when it doesn't know something

🔗 Building the complete RAG pipeline...
📋 RAG = Retrieval-Augmented Generation
   1. 🔍 Retrieve: Find relevant documents
   2. 📝 Augment: Add context to the question
   3. 🤖 Generate: Create AI-powered answer
🔗 Building Question-Answering pipeline...
   🎯 This combines search + AI to answer questions
   🤖 LLM not initialized, setting it up...
🤖 Connecting to Gemini 2.5 Pro...
   🧠 This is Google's most advanced language model


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


✅ Gemini 2.5 Pro connected successfully!
   🌡️ Temperature: 0.0 (0=factual, 1=creative)
   🎯 Ready to answer questions!
   📝 Creating prompt template...
   🔗 Connecting all components...
✅ Question-Answering pipeline ready!
   🔍 Can now search your documents
   🤖 Can answer questions using AI
   📚 Will show source documents
   🎉 Your GenAI system is complete!

🎉 YOUR GENAI SYSTEM IS READY!
   🤖 AI Model: gemini-2.5-pro ✅
   📚 Document Store: 81 chunks ✅
   🧮 Vector Database: ChromaDB ✅
   🔗 QA Pipeline: Complete RAG system ✅
   🎯 Ready to answer questions! 🚀

💡 What you've built:
   • A complete GenAI application from scratch
   • Enterprise-grade document search system
   • AI that can understand and answer questions about YOUR data
   • Production-ready RAG (Retrieval-Augmented Generation) pipeline

🚀 Next: Let's test your creation!


## 🧪 Step 6: Testing Your GenAI System - See the Magic in Action!

Time to test your creation! This is where you'll see the power of GenAI firsthand.

### 🎯 **What You'll Learn by Testing:**
- How semantic search finds relevant content
- How AI generates answers from your documents  
- The importance of source attribution
- How to evaluate GenAI system performance

### 🔍 **Types of Questions to Try:**
1. **Factual**: "What is...?" "When did...?" "How many...?"
2. **Analytical**: "Why does...?" "What are the implications...?"
3. **Comparative**: "How does X compare to Y?" 
4. **Summary**: "Summarize the main points about..."
5. **Exploratory**: "What does the document say about...?"

### 💡 **Understanding the Results:**
- **Answer**: AI-generated response based on your documents
- **Sources**: Which document chunks were used (like data lineage!)
- **Confidence**: How well the retrieved documents match your question

Let's start with an automated test to see your system in action!


In [16]:
# Test the semantic search engine with sample questions
def test_search_engine():
    """Test function to demonstrate the search engine capabilities."""
    
    if not search_engine.qa_chain:
        print("❌ QA chain not available. Please run the setup cells first.")
        return
    
    # Sample questions - modify these based on your documents
    sample_questions = [
        "What is the main topic discussed in the documents?",
        "Can you summarize the key points?",
        "What are the main conclusions or findings?",
        "Are there any specific recommendations mentioned?",
        "What methodology or approach is described?"
    ]
    
    print("🧪 Testing Semantic Search Engine")
    print("="*50)
    
    # Test with the first sample question
    question = sample_questions[0]
    print(f"\n📝 Sample Question: {question}")
    print("-" * 50)
    
    try:
        result = search_engine.ask_question(question)
        
        print(f"\n💡 Answer:")
        print(f"{result['answer']}")
        
        print(f"\n📚 Source Documents:")
        for i, doc in enumerate(result['source_documents'], 1):
            print(f"   {i}. {doc.metadata.get('source', 'Unknown')} (Preview: {doc.page_content[:100]}...)")
            
    except Exception as e:
        print(f"❌ Error during question answering: {str(e)}")
    
    print(f"\n💡 Try other questions:")
    for i, q in enumerate(sample_questions[1:], 2):
        print(f"   {i}. {q}")

# Run the test
test_search_engine()


🧪 Testing Semantic Search Engine

📝 Sample Question: What is the main topic discussed in the documents?
--------------------------------------------------
🤔 Processing question: 'What is the main topic discussed in the documents?'
🔍 Step 1: Searching for relevant information...
🤖 Step 2: Generating AI response...
📚 Step 3: Compiling sources...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



✅ Question answering complete!
💡 Answer generated using 4 source documents

💡 Answer:
Based on the provided context, the main topic discussed in the documents is the company **Metro AG**.

The text is a list of sources, the majority of which are annual reports ("Geschäftsberichte") from Metro AG for various years. Other sources mentioned relate to Metro AG's share ownership, corporate structure, and legal cases.

📚 Source Documents:
   1. Metro AG – Wikipedia.pdf (Preview: Geschäftsführung, 11. April 2025, abgerufen am 17. April 2025.
100. Reuters: Investor Kretinsky kont...)
   2. Metro AG – Wikipedia.pdf (Preview: 103. Geschäftsbericht 2018/19. (https://berichte.metroag.de/geschaeftsbericht/2018-2019/servicesei
t...)
   3. Metro AG – Wikipedia.pdf (Preview: 36. Die Geschäftsbereiche der Metro-Gruppe (https://web.archive.org/web/20080330120141/http://
www.n...)
   4. Metro AG – Wikipedia.pdf (Preview: 107. Geschäftsbericht 2022/23. (https://berichte.metroag.de/geschaeftsbericht/2022-

## 7. Interactive Question Answering

Use this cell to ask your own questions about the documents:


In [17]:
# Interactive question answering
def ask_custom_question(question: str):
    """Ask a custom question to your documents."""
    
    if not search_engine.qa_chain:
        print("❌ QA chain not available. Please run the setup cells first.")
        return
    
    print(f"🤔 Your Question: {question}")
    print("="*50)
    
    try:
        result = search_engine.ask_question(question)
        
        print(f"\n💡 Answer:")
        print(f"{result['answer']}")
        
        print(f"\n📚 Sources:")
        for i, doc in enumerate(result['source_documents'], 1):
            source = doc.metadata.get('source', 'Unknown')
            preview = doc.page_content[:150].replace('\n', ' ')
            print(f"   {i}. {source}")
            print(f"      Preview: {preview}...")
            
        return result
        
    except Exception as e:
        print(f"❌ Error: {str(e)}")
        return None

# Example usage:
# result = ask_custom_question("Your question here")

print("✅ Interactive question function ready!")
print("Usage: ask_custom_question('Your question about the documents')")


✅ Interactive question function ready!
Usage: ask_custom_question('Your question about the documents')


## 8. Search Similar Documents

You can also search for documents similar to a query without asking questions:


In [18]:
# Search for similar documents
def search_similar_documents(query: str, k: int = 3):
    """Search for documents similar to a query."""
    
    if not search_engine.vector_store:
        print("❌ Vector store not available. Please run the setup cells first.")
        return
    
    print(f"🔍 Searching for: '{query}'")
    print("="*50)
    
    try:
        results = search_engine.search_documents(query, k=k)
        
        print(f"\n📄 Found {len(results)} similar documents:")
        
        for i, doc in enumerate(results, 1):
            source = doc.metadata.get('source', 'Unknown')
            content_preview = doc.page_content[:200].replace('\n', ' ')
            
            print(f"\n{i}. Source: {source}")
            print(f"   Content Preview: {content_preview}...")
            print(f"   Length: {len(doc.page_content)} characters")
            
        return results
        
    except Exception as e:
        print(f"❌ Error during search: {str(e)}")
        return None

# Example usage:
# results = search_similar_documents("your search query", k=3)

print("✅ Document search function ready!")
print("Usage: search_similar_documents('your search query', k=3)")


✅ Document search function ready!
Usage: search_similar_documents('your search query', k=3)


## 🎉 Congratulations!

You've successfully built a semantic search engine with the following features:

### ✅ What You've Built:
- **Document Loading**: Automatically loads PDF, DOCX, and TXT files from GCS bucket
- **Text Processing**: Intelligently splits documents into searchable chunks
- **Vector Search**: Uses advanced embeddings for semantic similarity search
- **AI Question Answering**: Powered by Gemini 2.5 Pro for natural language responses
- **Source Attribution**: Shows which documents were used to answer questions

### 🚀 Next Steps:
1. **Add More Documents**: Upload more files to your GCS bucket and re-run the setup
2. **Customize Parameters**: Adjust chunk size, embedding models, or search parameters in the config
3. **Advanced Features**: Add document filtering, different prompt templates, or conversation memory
4. **UI Integration**: Build a web interface using Streamlit or Gradio
5. **Production Setup**: Add error handling, logging, and monitoring

### 🛠️ Configuration Summary:
- **GCS Bucket**: `{config.GCS_BUCKET_NAME}`
- **Model**: `{config.MODEL_NAME}`
- **Embedding Model**: `{config.EMBEDDING_MODEL}`
- **Chunk Size**: `{config.CHUNK_SIZE} characters`

### 📝 Usage Examples:
```python
# Ask questions about your documents
result = ask_custom_question("What are the main findings?")

# Search for similar content
docs = search_similar_documents("machine learning applications")
```

Happy searching! 🔍✨
