# Reranking for Enhanced RAG Systems

This notebook implements reranking techniques to improve retrieval quality in RAG systems. Reranking acts as a second filtering step after initial retrieval to ensure the most relevant content is used for response generation.

## Key Concepts of Reranking

1. **Initial Retrieval**: First pass using basic similarity search (less accurate but faster)
2. **Document Scoring**: Evaluating each retrieved document's relevance to the query
3. **Reordering**: Sorting documents by their relevance scores
4. **Selection**: Using only the most relevant documents for response generation

## Setting Up the Environment
We begin by importing necessary libraries.

In [2]:
import fitz # PyMuPDF
import os
import numpy as np
import json
import re
import google.generativeai as genai

In [3]:

import fitz
import os
import google.generativeai as genai
from dotenv import load_dotenv


## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [4]:
import fitz
from typing import List, Dict

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file using PyMuPDF (fitz).

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the PDF, or an empty string if an error occurs.
    """
    all_text = ""
    try:
        # Use a context manager to automatically close the document
        with fitz.open(pdf_path) as mypdf:
            # Iterate through each page to extract text
            for page in mypdf:
                all_text += page.get_text("text") + " "
    except Exception as e:
        print(f"Error reading PDF file: {e}")
        return ""
    
    return all_text.strip()

In [5]:


# --- 1. Your PDF text extraction function ---
def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the entire PDF.
    """
    all_text = []
    try:
        # Use a context manager to ensure the file is closed properly
        with fitz.open(pdf_path) as doc:
            # Iterate through each page in the PDF
            for page in doc:
                all_text.append(page.get_text("text"))
    except Exception as e:
        print(f"Error reading PDF file: {e}")
        return ""
        
    return "\n".join(all_text)

# --- 2. Gemini API Configuration (for a complete workflow) ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 3. Main Logic ---
if __name__ == "__main__":
    pdf_file = "/Users/kekunkoya/Desktop/770 Google /Homelessness.pdf"

    # Verify that the PDF file exists before proceeding
    if not os.path.exists(pdf_file):
        print(f"Error: PDF file not found at '{pdf_file}'")
        exit()

    # Step A: Extract text from the PDF
    print("Extracting text from PDF...")
    text = extract_text_from_pdf(pdf_file)
    print("Text extraction complete.")

    # Step B: Print the extracted text
    if text:
        # Print the first 500 characters to verify
        print("\nExtracted Text (first 500 characters):")
        print(text[:500] + "...")
    else:
        print("No text was extracted from the PDF.")

Extracting text from PDF...
Text extraction complete.

Extracted Text (first 500 characters):
19
Defining and Measuring Homelessness
Volker Busch-Geertsema
GISS, Germany
>> Abstract_ Substantial progress has been made at EU level on defining home-
lessness. The European Typology on Homelessness and Housing Exclusion 
(ETHOS) is widely accepted in almost all European countries (and beyond) as 
a useful conceptual framework and almost everywhere definitions at national 
level (though often not identical with ETHOS) are discussed in relation to this 
typology. The development and some of th...


## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [6]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

## Building a Simple Vector Store
To demonstrate how reranking integrate with retrieval, let's implement a simple vector store.

In [7]:
class SimpleVectorStore:
    """
    A simple vector store implementation using NumPy.
    """
    def __init__(self):
        """
        Initialize the vector store.
        """
        self.vectors = []  # List to store embedding vectors
        self.texts = []  # List to store original texts
        self.metadata = []  # List to store metadata for each text
    
    def add_item(self, text, embedding, metadata=None):
        """
        Add an item to the vector store.

        Args:
        text (str): The original text.
        embedding (List[float]): The embedding vector.
        metadata (dict, optional): Additional metadata.
        """
        self.vectors.append(np.array(embedding))  # Convert embedding to numpy array and add to vectors list
        self.texts.append(text)  # Add the original text to texts list
        self.metadata.append(metadata or {})  # Add metadata to metadata list, use empty dict if None
    
    def similarity_search(self, query_embedding, k=5):
        """
        Find the most similar items to a query embedding.

        Args:
        query_embedding (List[float]): Query embedding vector.
        k (int): Number of results to return.

        Returns:
        List[Dict]: Top k most similar items with their texts and metadata.
        """
        if not self.vectors:
            return []  # Return empty list if no vectors are stored
        
        # Convert query embedding to numpy array
        query_vector = np.array(query_embedding)
        
        # Calculate similarities using cosine similarity
        similarities = []
        for i, vector in enumerate(self.vectors):
            # Compute cosine similarity between query vector and stored vector
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))  # Append index and similarity score
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top k results
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],  # Add the corresponding text
                "metadata": self.metadata[idx],  # Add the corresponding metadata
                "similarity": score  # Add the similarity score
            })
        
        return results  # Return the list of top k similar items

## Creating Embeddings

In [8]:
import os
import google.generativeai as genai
from typing import List, Any
import numpy as np

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the create_embeddings function for Gemini ---
def create_embeddings(text: str or List[str], model: str = "models/embedding-001") -> Any:
    """
    Creates embeddings for the given text or list of texts using the Gemini API.

    Args:
    text (str or List[str]): The input text(s) for which embeddings are to be created.
    model (str): The model to be used for creating embeddings. Default is "models/embedding-001".

    Returns:
    List[float] or List[List[float]]: A list of embedding vectors.
    """
    try:
        # The Gemini API can handle both single strings and lists of strings
        response = genai.embed_content(
            model=model,
            content=text
        )
        
        # The API returns a dictionary with a single key 'embedding'
        # The value is a list of lists (for multiple texts) or a single list (for one text)
        if isinstance(text, str):
            # If the input was a single string, return the single embedding vector
            return response['embedding']
        else:
            # If the input was a list, return the list of embedding vectors
            return response['embedding']

    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return []

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Example 1: Create an embedding for a single string
    single_text = "Homelessness is a complex social issue."
    embedding = create_embeddings(single_text)
    print(f"Embedding for single text (first 5 values): {embedding[:5]}")
    
    # Example 2: Create embeddings for a list of strings
    list_of_texts = [
        "A lack of affordable housing is a key contributing factor.",
        "Social factors also play a role in homelessness."
    ]
    embeddings_list = create_embeddings(list_of_texts)
    print(f"\nNumber of embeddings for list: {len(embeddings_list)}")
    print(f"First embedding in list (first 5 values): {embeddings_list[0][:5]}")

Embedding for single text (first 5 values): [0.052571062, -0.03685706, -0.06520665, -0.04034025, 0.038206574]

Number of embeddings for list: 2
First embedding in list (first 5 values): [0.07521696, -0.034325134, -0.039195377, -0.008227663, 0.10222888]


## Document Processing Pipeline
Now that we have defined the necessary functions and classes, we can proceed to define the document processing pipeline.

In [10]:
def process_document(pdf_path, chunk_size=1000, chunk_overlap=200):
    """
    Process a document for RAG.

    Args:
    pdf_path (str): Path to the PDF file.
    chunk_size (int): Size of each chunk in characters.
    chunk_overlap (int): Overlap between chunks in characters.

    Returns:
    SimpleVectorStore: A vector store containing document chunks and their embeddings.
    """
    # Extract text from the PDF file
    print("Extracting text from PDF...")
    extracted_text = extract_text_from_pdf('/Users/kekunkoya/Desktop/770 Google /Homelessness.pdf')
    
    # Chunk the extracted text
    print("Chunking text...")
    chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)
    print(f"Created {len(chunks)} text chunks")
    
    # Create embeddings for the text chunks
    print("Creating embeddings for chunks...")
    chunk_embeddings = create_embeddings(chunks)
    
    # Initialize a simple vector store
    store = SimpleVectorStore()
    
    # Add each chunk and its embedding to the vector store
    for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings)):
        store.add_item(
            text=chunk,
            embedding=embedding,
            metadata={"index": i, "source": pdf_path}
        )
    
    print(f"Added {len(chunks)} chunks to the vector store")
    return store

## Implementing LLM-based Reranking
Let's implement the LLM-based reranking function using the OpenAI API.

In [11]:
import os
import re
import google.generativeai as genai
from typing import List, Dict, Any

# --- 1. Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the reranking function for Gemini ---
def rerank_with_llm(query: str, results: List[Dict], top_n: int = 3, model: str = "gemini-1.5-flash") -> List[Dict]:
    """
    Reranks search results using LLM relevance scoring.
    
    Args:
        query (str): User query
        results (List[Dict]): Initial search results
        top_n (int): Number of results to return after reranking
        model (str): Model to use for scoring
        
    Returns:
        List[Dict]: Reranked results
    """
    print(f"Reranking {len(results)} documents...")
    
    scored_results = []
    
    # Define the system prompt for the LLM
    system_prompt = """You are an expert at evaluating document relevance for search queries.
Your task is to rate documents on a scale from 0 to 10 based on how well they answer the given query.

Guidelines:
- Score 0-2: Document is completely irrelevant
- Score 3-5: Document has some relevant information but doesn't directly answer the query
- Score 6-8: Document is relevant and partially answers the query
- Score 9-10: Document is highly relevant and directly answers the query

You MUST respond with ONLY a single integer score between 0 and 10. Do not include ANY other text."""
    
    # Create the Gemini model instance once with the system prompt
    try:
        llm_model = genai.GenerativeModel(model, system_instruction=system_prompt)
    except Exception as e:
        print(f"Failed to initialize Gemini model: {e}")
        return []

    for i, result in enumerate(results):
        if i % 5 == 0:
            print(f"Scoring document {i+1}/{len(results)}...")
        
        # Define the user prompt for the LLM
        user_prompt = f"""Query: {query}

Document:
{result['text']}

Rate this document's relevance to the query on a scale from 0 to 10:"""
        
        try:
            # Get the LLM response
            response = llm_model.generate_content(user_prompt)
            
            # Extract the score from the LLM response
            score_text = response.text.strip()
            score_match = re.search(r'\b(10|[0-9])\b', score_text)
            
            if score_match:
                score = float(score_match.group(1))
            else:
                print(f"Warning: Could not extract score from response: '{score_text}', using similarity score instead")
                score = result.get("similarity", 0) * 10
        except Exception as e:
            print(f"Error during scoring API call for document {i+1}: {e}")
            score = result.get("similarity", 0) * 10 # Fallback in case of API error
        
        scored_results.append({
            "text": result["text"],
            "metadata": result["metadata"],
            "similarity": result.get("similarity", 0),
            "relevance_score": score
        })
    
    reranked_results = sorted(scored_results, key=lambda x: x["relevance_score"], reverse=True)
    
    return reranked_results[:top_n]

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate initial search results
    initial_results = [
        {"text": "A key factor is the lack of affordable housing.", "metadata": {"id": 1}, "similarity": 0.85},
        {"text": "The sun is the center of our solar system.", "metadata": {"id": 2}, "similarity": 0.70},
        {"text": "Economic factors like job loss contribute to homelessness.", "metadata": {"id": 3}, "similarity": 0.90},
        {"text": "Jupiter is the largest planet.", "metadata": {"id": 4}, "similarity": 0.65},
        {"text": "Government organizations provide shelters and support.", "metadata": {"id": 5}, "similarity": 0.75}
    ]
    query = "What are the economic causes of homelessness?"

    # Rerank the results
    final_results = rerank_with_llm(query, initial_results, top_n=2)

    # Print the reranked results
    print("\nFinal Reranked Results (Top 2):")
    for i, res in enumerate(final_results):
        print(f"Rank {i+1} (Score: {res['relevance_score']}): {res['text']}")

Reranking 5 documents...
Scoring document 1/5...

Final Reranked Results (Top 2):
Rank 1 (Score: 6.0): A key factor is the lack of affordable housing.
Rank 2 (Score: 6.0): Economic factors like job loss contribute to homelessness.


## Simple Keyword-based Reranking

In [12]:
def rerank_with_keywords(query, results, top_n=3):
    """
    A simple alternative reranking method based on keyword matching and position.
    
    Args:
        query (str): User query
        results (List[Dict]): Initial search results
        top_n (int): Number of results to return after reranking
        
    Returns:
        List[Dict]: Reranked results
    """
    # Extract important keywords from the query
    keywords = [word.lower() for word in query.split() if len(word) > 3]
    
    scored_results = []  # Initialize a list to store scored results
    
    for result in results:
        document_text = result["text"].lower()  # Convert document text to lowercase
        
        # Base score starts with vector similarity
        base_score = result["similarity"] * 0.5
        
        # Initialize keyword score
        keyword_score = 0
        for keyword in keywords:
            if keyword in document_text:
                # Add points for each keyword found
                keyword_score += 0.1
                
                # Add more points if keyword appears near the beginning
                first_position = document_text.find(keyword)
                if first_position < len(document_text) / 4:  # In the first quarter of the text
                    keyword_score += 0.1
                
                # Add points for keyword frequency
                frequency = document_text.count(keyword)
                keyword_score += min(0.05 * frequency, 0.2)  # Cap at 0.2
        
        # Calculate the final score by combining base score and keyword score
        final_score = base_score + keyword_score
        
        # Append the scored result to the list
        scored_results.append({
            "text": result["text"],
            "metadata": result["metadata"],
            "similarity": result["similarity"],
            "relevance_score": final_score
        })
    
    # Sort results by final relevance score in descending order
    reranked_results = sorted(scored_results, key=lambda x: x["relevance_score"], reverse=True)
    
    # Return the top_n results
    return reranked_results[:top_n]

## Response Generation

In [13]:
import os
import google.generativeai as genai
from typing import List

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the response generator for Gemini ---
def generate_response(query: str, context: str, model: str = "gemini-1.5-flash") -> str:
    """
    Generates a response based on the query and context using Gemini.

    Args:
        query (str): User query
        context (str): Retrieved context
        model (str): Model to use for response generation

    Returns:
        str: Generated response
    """
    # Define the system prompt to guide the AI's behavior
    system_prompt = "You are a helpful AI assistant. Answer the user's question based only on the provided context. If you cannot find the answer in the context, state that you don't have enough information."
    
    # Create the user prompt by combining the context and query
    user_prompt = f"""
Context:
{context}

Question: {query}

Please provide a comprehensive answer based only on the context above.
"""
    
    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        # Generate the response using the specified model
        response = gemini_model.generate_content(user_prompt)
        # Return the generated response content
        return response.text
    except Exception as e:
        print(f"An error occurred during response generation: {e}")
        return "I could not generate a response due to an error."

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a query and context from a previous step
    query = "What are the main causes of homelessness?"
    context = "Homelessness is a complex social problem. A key factor is the lack of affordable housing, which disproportionately affects low-income families and individuals."
    
    print("Generating AI response with Gemini...")
    ai_response = generate_response(query, context)
    
    print("\nAI Response:")
    print(ai_response)

Generating AI response with Gemini...

AI Response:
Based on the provided text, a key factor contributing to homelessness is the lack of affordable housing.  This disproportionately impacts low-income families and individuals.



## Full RAG Pipeline with Reranking
So far, we have implemented the core components of the RAG pipeline, including document processing, question answering, and reranking. Now, we will combine these components to create a full RAG pipeline.

In [14]:

# --- 2. Helper Functions (Assumed to be defined and configured for Gemini) ---
# These functions replace your original OpenAI-based helpers.
def create_embeddings(text: str or List[str], model: str = "models/embedding-001") -> Any:
    """
    Creates embeddings for the given text using the Gemini API.
    """
    try:
        response = genai.embed_content(model=model, content=text)
        # Gemini returns a list of embeddings, even for a single text.
        return response['embedding']
    except Exception as e:
        print(f"Embedding error: {e}")
        return []

def rerank_with_llm(query: str, results: List[Dict], top_n: int = 3, model: str = "gemini-1.5-flash") -> List[Dict]:
    """
    Reranks search results using LLM relevance scoring from Gemini.
    """
    system_prompt = "You are an expert at evaluating document relevance for search queries. Your task is to rate documents on a scale from 0 to 10. You MUST respond with ONLY a single integer score between 0 and 10. Do not include ANY other text."
    
    scored_results = []
    
    try:
        llm_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        for result in results:
            user_prompt = f"Query: {query}\n\nDocument:\n{result['text']}\n\nRate this document's relevance to the query on a scale from 0 to 10:"
            response = llm_model.generate_content(user_prompt)
            score = float(response.text.strip())
            scored_results.append({**result, "relevance_score": score})
    except Exception as e:
        print(f"Reranking error: {e}")
        return []

    return sorted(scored_results, key=lambda x: x["relevance_score"], reverse=True)[:top_n]

def rerank_with_keywords(query: str, results: List[Dict], top_n: int = 3) -> List[Dict]:
    """
    A simple keyword-based reranking method.
    """
    # This function is already API-agnostic, no changes needed.
    keywords = [word.lower() for word in query.split() if len(word) > 3]
    scored_results = []
    for result in results:
        document_text = result["text"].lower()
        keyword_score = sum([0.1 for keyword in keywords if keyword in document_text])
        final_score = result.get("similarity", 0) * 0.5 + keyword_score
        scored_results.append({**result, "relevance_score": final_score})
    return sorted(scored_results, key=lambda x: x["relevance_score"], reverse=True)[:top_n]


def generate_response(query: str, context: str, model: str = "gemini-1.5-flash") -> str:
    """
    Generates a response based on the query and context using Gemini.
    """
    system_prompt = "You are a helpful AI assistant. Answer the user's question based only on the provided context. If you cannot find the answer in the context, state that you don't have enough information."
    user_prompt = f"Context:\n{context}\n\nQuestion: {query}\n\nPlease provide a comprehensive answer based only on the context above."
    
    try:
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        response = gemini_model.generate_content(user_prompt)
        return response.text
    except Exception as e:
        print(f"Response generation error: {e}")
        return "I could not generate a response due to an error."

# --- 3. Your main RAG pipeline function (revised) ---
def rag_with_reranking(query: str, vector_store, reranking_method: str = "llm", top_n: int = 3, model: str = "gemini-1.5-flash") -> Dict[str, Any]:
    """
    Complete RAG pipeline incorporating reranking.
    """
    # Create query embedding
    query_embedding_list = create_embeddings(query)
    if not query_embedding_list:
        return {"error": "Failed to create query embedding."}
    query_embedding = query_embedding_list[0] # The first item is the single embedding

    # Initial retrieval (get more than we need for reranking)
    initial_results = vector_store.similarity_search(query_embedding, k=10)
    
    # Apply reranking
    if reranking_method == "llm":
        reranked_results = rerank_with_llm(query, initial_results, top_n=top_n, model=model)
    elif reranking_method == "keywords":
        reranked_results = rerank_with_keywords(query, initial_results, top_n=top_n)
    else:
        reranked_results = initial_results[:top_n]
    
    # Combine context from reranked results
    context = "\n\n===\n\n".join([result["text"] for result in reranked_results])
    
    # Generate response based on context
    response = generate_response(query, context, model=model)
    
    return {
        "query": query,
        "reranking_method": reranking_method,
        "initial_results": initial_results[:top_n],
        "reranked_results": reranked_results,
        "context": context,
        "response": response
    }

## Evaluating Reranking Quality

In [19]:
import json

# --- Load the validation data from a JSON file ---
with open('/Users/kekunkoya/Desktop/770 Google /valh.json', 'r') as f:
    data = json.load(f)

# --- Extract the first query and reference answer ---
query = data[0]['question']
reference_answer = data[0]['ideal_answer']

# --- Path to the PDF file for processing ---
pdf_path = "/Users/kekunkoya/Desktop/770 Google /Homelessness.pdf"

# --- Example: For Google Gemini usage ---
import google.generativeai as genai
import os

# Configure your Google API Key
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

# You can now pass `query`, `reference_answer`, and `pdf_path`
# into your Gemini RAG or embedding pipeline


In [1]:
import time
import math
import numpy as np
from PyPDF2 import PdfReader
import google.generativeai as genai

GEMINI_MODEL = "gemini-1.5-pro"
EMBED_MODEL  = "text-embedding-004"

# ---- Tunables to prevent OOM ----
MAX_PAGES     = 150        # hard cap pages read (raise if needed)
CHUNK_SIZE    = 1000       # chars per chunk
CHUNK_OVERLAP = 120
CHUNK_LIMIT   = 1200       # hard cap number of chunks sent to embed
BATCH_SIZE    = 8          # how many chunks to embed at a time
RETRY_MAX     = 4

def extract_text_from_pdf(pdf_path: str, max_pages: int = MAX_PAGES) -> str:
    out = []
    with open(pdf_path, "rb") as f:
        reader = PdfReader(f)
        total = min(len(reader.pages), max_pages)
        for i in range(total):
            t = reader.pages[i].extract_text() or ""
            if t.strip():
                out.append(t)
    return "\n".join(out).strip()

def chunk_text(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP):
    chunks, n = [], len(text)
    i = 0
    while i < n:
        j = min(n, i + chunk_size)
        chunk = text[i:j].strip()
        if chunk:
            chunks.append(chunk)
        # move forward with overlap
        i = j - overlap
        if i < 0:
            i = 0
        if i >= j:  # safety
            i = j
    return chunks

def _embed_batch(batch_texts):
    # robust single-call batch embed with retries
    for attempt in range(1, RETRY_MAX + 1):
        try:
            # embed_content supports a list via "contents"
            resp = genai.embed_content(model=EMBED_MODEL, content=batch_texts)
            # response may be a dict with "embedding" per item or a list — normalize:
            if isinstance(resp, dict) and "embeddings" in resp:
                vecs = [e["values"] for e in resp["embeddings"]]
            elif isinstance(resp, list):
                vecs = [r["embedding"]["values"] for r in resp]
            else:
                # sometimes single returns dict with "embedding"
                if "embedding" in resp:
                    vecs = [resp["embedding"]["values"]]
                else:
                    raise ValueError("Unexpected embed response shape")
            return vecs
        except Exception as e:
            if attempt == RETRY_MAX:
                raise
            time.sleep(1.5 * attempt)

def embed_texts_gemini(texts):
    """Memory-friendly, batched embedding with padding."""
    all_vecs = []
    for start in range(0, len(texts), BATCH_SIZE):
        batch = texts[start:start + BATCH_SIZE]
        try:
            vecs = _embed_batch(batch)
        except Exception as e:
            # fallback: zero-vectors for this batch length
            vecs = None
            print(f"Embedding batch failed at {start}: {e}")
        if vecs is None:
            # guess dim later; store placeholder
            all_vecs.extend([[] for _ in batch])
        else:
            all_vecs.extend(vecs)

    # find max dim and pad
    maxd = max((len(v) for v in all_vecs), default=0)
    if maxd == 0:
        raise ValueError("All embeddings failed or returned empty.")
    padded = [(v + [0.0]*(maxd - len(v))) if len(v) < maxd else v for v in all_vecs]
    return np.asarray(padded, dtype=np.float32)

def process_document_gemini(pdf_path: str,
                            chunk_size=CHUNK_SIZE,
                            overlap=CHUNK_OVERLAP,
                            chunk_limit=CHUNK_LIMIT):
    text = extract_text_from_pdf(pdf_path, max_pages=MAX_PAGES)
    if not text:
        raise ValueError("No extractable text found in PDF (scanned image-only PDF?).")
    chunks = chunk_text(text, chunk_size=chunk_size, overlap=overlap)
    if not chunks:
        raise ValueError("Chunking produced no text.")

    # Hard cap to avoid OOM
    if len(chunks) > chunk_limit:
        print(f"⚠️ Truncating chunks {len(chunks)} → {chunk_limit} to avoid OOM.")
        chunks = chunks[:chunk_limit]

    # Embed in memory-safe batches
    embeddings = embed_texts_gemini(chunks)

    return {"chunks": chunks, "embeddings": embeddings}


In [2]:
# pip install google-generativeai PyPDF2
import os
import re
import json
import math
import numpy as np
from PyPDF2 import PdfReader
import google.generativeai as genai

# -----------------------------
# Config
# -----------------------------
GEMINI_MODEL = "gemini-1.5-pro"
EMBED_MODEL  = "text-embedding-004"
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

# -----------------------------
# Utilities
# -----------------------------
def extract_text_from_pdf(pdf_path: str) -> str:
    text = []
    with open(pdf_path, "rb") as f:
        reader = PdfReader(f)
        for p in reader.pages:
            txt = p.extract_text() or ""
            text.append(txt)
    # Basic clean
    return re.sub(r"[ \t]+", " ", "\n".join(text)).strip()

def chunk_text(text: str, chunk_size: int = 1200, overlap: int = 150):
    """
    Simple token-ish chunking by characters to avoid extra deps.
    Tune sizes for your PDFs and model context window.
    """
    chunks = []
    i = 0
    n = len(text)
    while i < n:
        j = min(n, i + chunk_size)
        chunks.append(text[i:j])
        i = j - overlap
        if i < 0: i = 0
    return [c.strip() for c in chunks if c.strip()]

def embed_texts_gemini(texts):
    # Google API accepts one item at a time for embed_content
    vecs = []
    for t in texts:
        try:
            resp = genai.embed_content(model=EMBED_MODEL, content=t)
            vecs.append(resp["embedding"]["value"])
        except Exception as e:
            print(f"Embedding failed for a chunk: {e}")
            vecs.append([0.0])  # keep shape consistent later
    # Pad to same length if anything went wrong
    maxd = max(len(v) for v in vecs)
    padded = [v + [0.0]*(maxd - len(v)) for v in vecs]
    return np.array(padded, dtype=np.float32)

def cosine_sim_matrix(a, b):
    a_norm = a / (np.linalg.norm(a, axis=1, keepdims=True) + 1e-12)
    b_norm = b / (np.linalg.norm(b, axis=1, keepdims=True) + 1e-12)
    return a_norm @ b_norm.T

def keyword_overlap_score(query, text):
    qs = set(re.findall(r"\b\w+\b", query.lower()))
    ts = set(re.findall(r"\b\w+\b", text.lower()))
    if not qs: return 0.0
    return len(qs & ts) / math.sqrt(len(qs) * (len(ts) + 1e-9))

# -----------------------------
# Core: document processing
# -----------------------------
def process_document_gemini(pdf_path: str, chunk_size=1200, overlap=150):
    full_text = extract_text_from_pdf(pdf_path)
    chunks = chunk_text(full_text, chunk_size=chunk_size, overlap=overlap)
    if not chunks:
        raise ValueError("No text extracted from PDF.")
    embeddings = embed_texts_gemini(chunks)
    return {
        "chunks": chunks,
        "embeddings": embeddings,  # np.ndarray [num_chunks, dim]
    }

# -----------------------------
# Reranking helpers
# -----------------------------
def retrieve_top_k(query, vector_store, k=6):
    qvec = embed_texts_gemini([query])  # [1, dim]
    sims = cosine_sim_matrix(qvec, vector_store["embeddings"])[0]  # [num_chunks]
    top_idx = np.argsort(-sims)[:k]
    return [(int(i), float(sims[i]), vector_store["chunks"][i]) for i in top_idx]

def llm_rerank_gemini(query, candidates):
    """
    Ask Gemini to rank the candidate contexts by usefulness; returns same list re-ordered.
    """
    model = genai.GenerativeModel(GEMINI_MODEL)
    numbered = "\n\n".join([f"[{i}] {c[:1200]}" for i, (_, _, c) in enumerate(candidates)])
    prompt = f"""
You are ranking context passages for answering a question. 
Question: {query}

Passages:
{numbered}

Return a JSON list of passage indices in best-to-worst order, for example: [2,0,1,...]
Only return the JSON array.
"""
    try:
        resp = model.generate_content(prompt)
        txt = resp.text.strip()
        order = json.loads(txt)
        # Validate indices
        order = [o for o in order if isinstance(o, int) and 0 <= o < len(candidates)]
        if len(order) != len(candidates):
            # Fallback: keep originals if weird output
            return candidates
        return [candidates[i] for i in order]
    except Exception as e:
        print(f"LLM rerank failed: {e}")
        return candidates

def keyword_rerank(query, candidates):
    scored = [(kw := keyword_overlap_score(query, c[2]), c) for c in candidates]
    scored.sort(key=lambda x: -x[0])
    return [c for _, c in scored]

# -----------------------------
# RAG pipeline
# -----------------------------
def rag_with_reranking_gemini(query, vector_store, reranking_method="none", top_k=6, max_ctx_chars=3500):
    # 1) retrieve by embeddings
    candidates = retrieve_top_k(query, vector_store, k=top_k)

    # 2) rerank if requested
    if reranking_method == "llm":
        candidates = llm_rerank_gemini(query, candidates)
    elif reranking_method == "keywords":
        candidates = keyword_rerank(query, candidates)
    # else "none" → leave as is

    # 3) build context
    context_parts = []
    total = 0
    for _, _, chunk in candidates:
        if total + len(chunk) > max_ctx_chars: break
        context_parts.append(chunk)
        total += len(chunk)
    context = "\n\n---\n\n".join(context_parts)

    # 4) ask Gemini
    system_msg = (
        "You are a careful research assistant. "
        "Answer using ONLY the provided context. If the answer is not in the context, say you cannot find it."
    )
    user_prompt = f"Question: {query}\n\nContext:\n{context}\n\nAnswer:"

    model = genai.GenerativeModel(GEMINI_MODEL)
    try:
        resp = model.generate_content([system_msg, user_prompt])
        answer = resp.text.strip()
    except Exception as e:
        answer = f"[Generation failed: {e}]"

    return {
        "response": answer,
        "used_chunks": [c[0] for c in candidates],
    }

# -----------------------------
# Example usage (drop-in)
# -----------------------------
if __name__ == "__main__":
    pdf_path = "/Users/kekunkoya/Desktop/ISEM 770 Class Project/Homelessness.pdf"

    print("Processing document with Gemini embeddings...")
    vector_store = process_document_gemini(pdf_path)

    query = "What measurement approaches are commonly used to collect data on homelessness in Europe?"
    print("Comparing retrieval methods...")

    print("\n=== STANDARD RETRIEVAL ===")
    standard_results = rag_with_reranking_gemini(query, vector_store, reranking_method="none")
    print(f"\nQuery: {query}")
    print(f"\nResponse:\n{standard_results['response']}")

    print("\n=== LLM-BASED RERANKING ===")
    llm_results = rag_with_reranking_gemini(query, vector_store, reranking_method="llm")
    print(f"\nQuery: {query}")
    print(f"\nResponse:\n{llm_results['response']}")

    print("\n=== KEYWORD-BASED RERANKING ===")
    keyword_results = rag_with_reranking_gemini(query, vector_store, reranking_method="keywords")
    print(f"\nQuery: {query}")
    print(f"\nResponse:\n{keyword_results['response']}")


Processing document with Gemini embeddings...


: 

In [15]:
def evaluate_reranking(query, standard_results, reranked_results, reference_answer=None):
    """
    Evaluates the quality of reranked results compared to standard results.
    
    Args:
        query (str): User query
        standard_results (Dict): Results from standard retrieval
        reranked_results (Dict): Results from reranked retrieval
        reference_answer (str, optional): Reference answer for comparison
        
    Returns:
        str: Evaluation output
    """
    # Define the system prompt for the AI evaluator
    system_prompt = """You are an expert evaluator of RAG systems.
    Compare the retrieved contexts and responses from two different retrieval methods.
    Assess which one provides better context and a more accurate, comprehensive answer."""
    
    # Prepare the comparison text with truncated contexts and responses
    comparison_text = f"""Query: {query}

Standard Retrieval Context:
{standard_results['context'][:1000]}... [truncated]

Standard Retrieval Answer:
{standard_results['response']}

Reranked Retrieval Context:
{reranked_results['context'][:1000]}... [truncated]

Reranked Retrieval Answer:
{reranked_results['response']}"""

    # If a reference answer is provided, include it in the comparison text
    if reference_answer:
        comparison_text += f"""
        
Reference Answer:
{reference_answer}"""

    # Create the user prompt for the AI evaluator
    user_prompt = f"""
{comparison_text}

Please evaluate which retrieval method provided:
1. More relevant context
2. More accurate answer
3. More comprehensive answer
4. Better overall performance

Provide a detailed analysis with specific examples.
"""
    
    # Generate the evaluation response using the specified model
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.2-3B-Instruct",
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    
    # Return the evaluation output
    return response.choices[0].message.content

In [1]:
def evaluate_reranking_gemini(query, standard_results, reranked_results, reference_answer=None):
    """
    Evaluates reranked vs standard results using Google Gemini.
    Expects results dicts to contain 'context' and 'response'.
    """
    # System + user prompts
    system_prompt = (
        "You are an expert evaluator of RAG systems. "
        "Compare the retrieved contexts and responses from two different retrieval methods. "
        "Assess which one provides better context and a more accurate, comprehensive answer."
    )

    # Guard for missing keys
    std_ctx = (standard_results.get("context") or "")[:1000]
    std_ans = standard_results.get("response") or ""
    rr_ctx  = (reranked_results.get("context") or "")[:1000]
    rr_ans  = reranked_results.get("response") or ""

    comparison_text = f"""Query: {query}

Standard Retrieval Context:
{std_ctx}... [truncated]

Standard Retrieval Answer:
{std_ans}

Reranked Retrieval Context:
{rr_ctx}... [truncated]

Reranked Retrieval Answer:
{rr_ans}"""

    if reference_answer:
        comparison_text += f"""

Reference Answer:
{reference_answer}"""

    user_prompt = f"""{comparison_text}

Please evaluate which retrieval method provided:
1. More relevant context
2. More accurate answer
3. More comprehensive answer
4. Better overall performance

Provide a detailed analysis with specific examples."""

    # Call Gemini
    model = genai.GenerativeModel(GEMINI_MODEL)
    try:
        resp = model.generate_content(
            [{"role": "user", "parts": [system_prompt + "\n\n" + user_prompt]}],
            generation_config={"temperature": 0}
        )
        return (resp.text or "").strip()
    except Exception as e:
        return f"[Evaluation failed: {e}]"


In [3]:
def rag_with_reranking_gemini(query, vector_store, reranking_method="none", top_k=6, max_ctx_chars=3500):
    # 1) retrieve by embeddings
    candidates = retrieve_top_k(query, vector_store, k=top_k)

    # 2) rerank if requested
    if reranking_method == "llm":
        candidates = llm_rerank_gemini(query, candidates)
    elif reranking_method == "keywords":
        candidates = keyword_rerank(query, candidates)

    # 3) build context
    context_parts = []
    total = 0
    for _, _, chunk in candidates:
        if total + len(chunk) > max_ctx_chars:
            break
        context_parts.append(chunk)
        total += len(chunk)
    context = "\n\n---\n\n".join(context_parts)

    # 4) ask Gemini
    system_msg = (
        "You are a careful research assistant. "
        "Answer using ONLY the provided context. If the answer is not in the context, say you cannot find it."
    )
    user_prompt = f"Question: {query}\n\nContext:\n{context}\n\nAnswer:"

    model = genai.GenerativeModel(GEMINI_MODEL)
    try:
        resp = model.generate_content([system_msg, user_prompt])
        answer = resp.text.strip()
    except Exception as e:
        answer = f"[Generation failed: {e}]"

    # ✅ Proper return inside the function
    return {
        "response": answer,
        "used_chunks": [c[0] for c in candidates],
        "context": context  # added so evaluate_reranking_gemini works
    }


In [19]:
import os
import google.generativeai as genai
from typing import List, Dict, Any

# --- 0) Initialize Gemini client ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- Define the evaluation function for Gemini ---
def evaluate_reranking(query: str, standard_results: Any, reranked_results: Any, reference_answer: str, model: str = "gemini-1.5-flash") -> str:
    """
    Evaluates reranking results using Gemini.
    
    Args:
    query (str): User's question.
    standard_results (Any): Results before reranking.
    reranked_results (Any): Results after reranking.
    reference_answer (str): The ideal answer.
    model (str): Model to use for evaluation.

    Returns:
    str: A detailed analysis of the results.
    """
    system_prompt = (
        "You are an objective evaluator. Provide a detailed analysis of how the two result sets compare "
        "against the reference answer, with specific examples."
    )
    
    comparison_text = (
        f"Query: {query}\n\n"
        f"Standard Results:\n{standard_results}\n\n"
        f"Reranked Results:\n{reranked_results}\n\n"
        f"Reference Answer:\n{reference_answer}"
    )
    
    user_prompt = (
        f"{comparison_text}\n\n"
        "Please compare faithfulness, relevance, and ordering—point out specific strengths and weaknesses."
    )
    
    try:
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        resp = gemini_model.generate_content(user_prompt)
        return resp.text
    except Exception as e:
        print(f"An error occurred during evaluation: {e}")
        return "Evaluation failed due to an error."

# --- Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate data for a runnable example
    query = "What are the main causes of homelessness?"
    standard_results = ["lack of affordable housing", "job loss", "mental health issues"]
    llm_results = ["job loss", "lack of affordable housing", "mental health issues"]
    reference_answer = "Homelessness is a complex issue driven by a lack of affordable housing, economic instability, and personal crises like job loss."

    # 1) Then call it as before:
    evaluation = evaluate_reranking(
        query=query,
        standard_results=standard_results,
        reranked_results=llm_results,
        reference_answer=reference_answer
    )
    
    # 2) Print
    print("\n=== EVALUATION RESULTS ===")
    print(evaluation)


=== EVALUATION RESULTS ===
## Analysis of Result Sets Compared to Reference Answer

Both the standard and reranked result sets perform similarly well against the reference answer in terms of faithfulness and relevance, but differ slightly in ordering.  Let's break down each aspect:

**Faithfulness:**

* **Standard Results:**  Faithfully reflects the key points in the reference answer.  "Lack of affordable housing," "job loss," and "mental health issues" are all directly mentioned or implied as significant contributing factors in the reference.
* **Reranked Results:** Equally faithful. The same three causes are present, just in a different order.

**Relevance:**

* **Standard Results:** All three items are highly relevant to the causes of homelessness.  They represent significant contributing factors supported by extensive research.
* **Reranked Results:**  Same as above. The relevance of each item remains unchanged by the reordering.

**Ordering:**

* **Standard Results:** The order (