# Problem 2 - Semantic Search for RAG Systems  (30 points)

## Tasks:

1. **(12 points) Document Processing and Vector Store Setup:**
   - Load text documents using appropriate document loaders (e.g., PyPDFLoader for PDFs)
   - Split documents into chunks using `chunk_size` and `chunk_overlap` parameters with `RecursiveCharacterTextSplitter`
   - Generate embeddings for each chunk using a pre-trained embedding model
   - Store the embeddings in a vector database (e.g., Chroma) with proper metadata
   - Demonstrate the setup with sample documents and verify chunk creation

2. **(10 points) Implement Semantic Search with Multiple Methods:**
   - Implement similarity search using `vector_store.similarity_search()` method
   - Implement Maximum Marginal Relevance (MMR) search using `vector_store.max_marginal_relevance_search()` method
   - **MMR Definition:** A retrieval strategy that balances relevance with diversity to reduce redundant results.  
     MMR iteratively selects documents that are both relevant to the query **and** different from already-selected documents.  
     Use the formula:  
    $$MMR = \lambda \times Sim(doc, query) - (1-\lambda) \times \max(Sim(doc, selected))$$

     where \( \lambda = 0.5 \)
   - Demonstrate both search methods with the same queries and compare results
   - Include search with similarity scores using `similarity_search_with_score()`

3. **(8 points) Evaluate your implementation by:**
   - Testing on a provided dataset of 20 short articles about different topics
   - Comparing retrieval quality between different chunk sizes (500, 1000, 1500 characters)
   - Measuring and reporting search latency for different values of k (1, 5, 10)
   - Analyzing the effect of different search types on result diversity (compare similarity vs MMR results for the same query)
   - Demonstrating scenarios where MMR provides better coverage than similarity search
   - Creating visualizations of embedding similarities using dimensionality reduction


### Bonus:
- **(+4 points):** Implement a hybrid search combining semantic search with keyword-based BM25 scoring
- **(+3 points):** Create a simple web interface demonstrating your search engine
- **(+3 points):** Implement and compare different embedding models (e.g., OpenAI, Cohere, local models)


In [1]:
# Install required packages (run once if needed)
# Fix PyTorch DLL issue on Windows by installing CPU-only version
# This is required because langchain_text_splitters imports transformers which needs torch
%pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Install LangChain and other required packages
%pip install -q "langchain>=0.2.10" "langchain-community>=0.2.10" \
                "langchain-text-splitters>=0.2.0" "chromadb>=0.5.5" \
                "pypdf>=4.2.0" "langchain-google-genai>=1.0.0" \
                "google-generativeai>=0.3.0" "python-dotenv>=1.0.0"

Note: you may need to restart the kernel to use updated packages.




^C
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Import necessary libraries for environment variables
import os
from dotenv import load_dotenv

# Load environment variables (including GOOGLE_API_KEY for Gemini)
load_dotenv()

# Verify API key is loaded
api_key = os.getenv("GOOGLE_API_KEY")
if api_key:
    print("✓ GOOGLE_API_KEY loaded successfully")
else:
    print("⚠ Warning: GOOGLE_API_KEY not found in environment variables")
    print("  Please set GOOGLE_API_KEY in your .env file or environment")

✓ GOOGLE_API_KEY loaded successfully


In [None]:
# --- imports & paths ---
import os, glob

# Try importing with error handling for Windows DLL issues
try:
    from langchain_community.document_loaders import PyPDFLoader, TextLoader
    from langchain_text_splitters import RecursiveCharacterTextSplitter
    from langchain_community.vectorstores import Chroma
    from langchain_google_genai import GoogleGenerativeAIEmbeddings
    print("✓ All imports successful")
except OSError as e:
    if "DLL" in str(e) or "torch" in str(e).lower():
        print("⚠ PyTorch DLL error detected. Attempting to fix...")
        print("  Please restart your kernel after running the installation cell again.")
        print("  Or run this in a new terminal: pip install torch --index-url https://download.pytorch.org/whl/cpu")
        raise Exception("PyTorch DLL error. Please install torch CPU version and restart kernel.")
    else:
        raise

# Configuration
DATA_DIR = "./data"                  # Directory containing documents (PDF/TXT/MD)
PERSIST_DIR = "./chroma_rag_demo"    # Where Chroma DB will be saved
COLLECTION = "rag_demo"              # Collection name in Chroma
CHUNK_SIZE = 1000                    # Size of each text chunk in characters
CHUNK_OVERLAP = 200                  # Overlap between chunks in characters

print(f"\n✓ Configuration loaded")
print(f"  - Data directory: {DATA_DIR}")
print(f"  - Chunk size: {CHUNK_SIZE}, Overlap: {CHUNK_OVERLAP}")
print(f"  - Using Google Gemini embeddings")

OSError: [WinError 1114] A dynamic link library (DLL) initialization routine failed. Error loading "c:\Users\riddh\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\lib\c10.dll" or one of its dependencies.

In [None]:
# Load documents from the data directory
paths = sorted(
    glob.glob(os.path.join(DATA_DIR, "**/*.pdf"), recursive=True)
  + glob.glob(os.path.join(DATA_DIR, "**/*.txt"), recursive=True)
  + glob.glob(os.path.join(DATA_DIR, "**/*.md"),  recursive=True)
)

print(f"Found {len(paths)} files in {DATA_DIR}")
docs = []
for p in paths:
    try:
        if p.lower().endswith(".pdf"):
            loader = PyPDFLoader(p)
        else:
            loader = TextLoader(p, encoding="utf-8")
        loaded_docs = loader.load()
        docs.extend(loaded_docs)
        print(f"Loaded {len(loaded_docs)} document(s) from {os.path.basename(p)}")
    except Exception as e:
        print(f"Error loading {p}: {e}")

print(f"\nTotal: Loaded {len(docs)} documents from {len(paths)} files")
if len(docs) > 0:
    print(f"\nSample document info:")
    print(f"  - First doc length: {len(docs[0].page_content)} characters")
    print(f"  - First doc source: {docs[0].metadata.get('source', 'N/A')}")

In [None]:
# Split documents into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    add_start_index=True,
)

chunks = splitter.split_documents(docs)
print(f"Created {len(chunks)} chunks from {len(docs)} documents")
print(f"Chunk size: {CHUNK_SIZE} characters")
print(f"Chunk overlap: {CHUNK_OVERLAP} characters")

# Display sample chunk information
if len(chunks) > 0:
    print(f"\nSample chunk information:")
    print(f"  - First chunk length: {len(chunks[0].page_content)} characters")
    print(f"  - First chunk preview: {chunks[0].page_content[:150]}...")
    if 'start_index' in chunks[0].metadata:
        print(f"  - Start index: {chunks[0].metadata['start_index']}")
    print(f"  - Source: {chunks[0].metadata.get('source', 'N/A')}")
    
    # Show distribution of chunk sizes
    chunk_lengths = [len(chunk.page_content) for chunk in chunks]
    print(f"\nChunk size statistics:")
    print(f"  - Min: {min(chunk_lengths)} characters")
    print(f"  - Max: {max(chunk_lengths)} characters")
    print(f"  - Average: {sum(chunk_lengths) / len(chunk_lengths):.1f} characters")

In [None]:
# Initialize Google Gemini embedding model
# Make sure GOOGLE_API_KEY is set in your environment or .env file
try:
    emb = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    print("✓ Google Gemini embedding model initialized successfully!")
    print(f"  - Model: models/embedding-001")
    print(f"  - Provider: Google Generative AI (Gemini)")
    
    # Test the embedding model with a sample text
    test_embedding = emb.embed_query("Test embedding")
    print(f"  - Embedding dimension: {len(test_embedding)}")
    
except Exception as e:
    print(f"✗ Error initializing Gemini embeddings: {e}")
    print("  Please ensure:")
    print("    1. GOOGLE_API_KEY is set in your environment or .env file")
    print("    2. You have installed: pip install langchain-google-genai google-generativeai")
    raise

In [None]:
# Create vector store with embeddings
print(f"Creating vector store with {len(chunks)} chunks...")
print(f"Persist directory: {PERSIST_DIR}")
print(f"Collection name: {COLLECTION}")

if len(chunks) == 0:
    raise ValueError("Cannot create vector store: no chunks available. Please load and split documents first.")

vs = Chroma.from_documents(
    documents=chunks,
    embedding=emb,
    persist_directory=PERSIST_DIR,
    collection_name=COLLECTION,
)
vs.persist()

print(f"\n✓ Vector store created and persisted successfully!")
print(f"  - Persisted at: {os.path.abspath(PERSIST_DIR)}")
print(f"  - Vector count: {vs._collection.count()}")
print(f"  - Collection: {COLLECTION}")

In [None]:
# Verify vector store contents and metadata
print("Verifying vector store setup...")
print(f"Total vectors in collection: {vs._collection.count()}")

# Get sample documents with metadata
sample_results = vs._collection.get(include=["metadatas", "documents"], limit=3)
sample_metadatas = sample_results.get("metadatas", [])
sample_documents = sample_results.get("documents", [])

print(f"\nSample metadata from {len(sample_metadatas)} documents:")
for i, (meta, doc) in enumerate(zip(sample_metadatas, sample_documents), 1):
    print(f"\n  Document {i}:")
    print(f"    Source: {meta.get('source', 'N/A')}")
    if 'start_index' in meta:
        print(f"    Start index: {meta['start_index']}")
    print(f"    Content preview: {doc[:100]}...")

print("\n✓ Vector store verification complete!")
print("✓ Task A: Document Processing and Vector Store Setup - COMPLETE")