# Query Transformations for Enhanced RAG Systems

This notebook implements three query transformation techniques to enhance retrieval performance in RAG systems without relying on specialized libraries like LangChain. By modifying user queries, we can significantly improve the relevance and comprehensiveness of retrieved information.

## Key Transformation Techniques

1. **Query Rewriting**: Makes queries more specific and detailed for better search precision.
2. **Step-back Prompting**: Generates broader queries to retrieve useful contextual information.
3. **Sub-query Decomposition**: Breaks complex queries into simpler components for comprehensive retrieval.

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [2]:
# Initialize the OpenAI client with the base URL and API key
client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY")  # Retrieve the API key from environment variables
)

## Implementing Query Transformation Techniques
### 1. Query Rewriting
This technique makes queries more specific and detailed to improve precision in retrieval.

In [3]:
def rewrite_query(original_query, model="gemini-2.0-flash"):
    """
    Rewrites a query to make it more specific and detailed for better retrieval.
    
    Args:
        original_query (str): The original user query
        model (str): The model to use for query rewriting
        
    Returns:
        str: The rewritten query
    """
    # Define the system prompt to guide the AI assistant's behavior
    system_prompt = "You are an AI assistant specialized in improving search queries. Your task is to rewrite user queries to be more specific, detailed, and likely to retrieve relevant information."
    
    # Define the user prompt with the original query to be rewritten
    user_prompt = f"""
    Rewrite the following query to make it more specific and detailed. Include relevant terms and concepts that might help in retrieving accurate information.
    
    Original query: {original_query}
    
    Rewritten query:
    """
    
    # Generate the rewritten query using the specified model
    response = client.chat.completions.create(
        model=model,
        temperature=0.1,  # Low temperature for deterministic output
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    
    # Return the rewritten query, stripping any leading/trailing whitespace
    return response.choices[0].message.content.strip()

### 2. Step-back Prompting
This technique generates broader queries to retrieve contextual background information.

In [4]:
def generate_step_back_query(original_query, model="gemini-2.0-flash"):
    """
    Generates a more general 'step-back' query to retrieve broader context.
    
    Args:
        original_query (str): The original user query
        model (str): The model to use for step-back query generation
        
    Returns:
        str: The step-back query
    """
    # Define the system prompt to guide the AI assistant's behavior
    system_prompt = "You are an AI assistant specialized in search strategies. Your task is to generate broader, more general versions of specific queries to retrieve relevant background information."
    
    # Define the user prompt with the original query to be generalized
    user_prompt = f"""
    Generate a broader, more general version of the following query that could help retrieve useful background information.
    
    Original query: {original_query}
    
    Step-back query:
    """
    
    # Generate the step-back query using the specified model
    response = client.chat.completions.create(
        model=model,
        temperature=0.1,  # Slightly higher temperature for some variation
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    
    # Return the step-back query, stripping any leading/trailing whitespace
    return response.choices[0].message.content.strip()

### 3. Sub-query Decomposition
This technique breaks down complex queries into simpler components for comprehensive retrieval.

In [5]:
def decompose_query(original_query, num_subqueries=4, model="gemini-2.0-flash"):
    """
    Decomposes a complex query into simpler sub-queries.
    
    Args:
        original_query (str): The original complex query
        num_subqueries (int): Number of sub-queries to generate
        model (str): The model to use for query decomposition
        
    Returns:
        List[str]: A list of simpler sub-queries
    """
    # Define the system prompt to guide the AI assistant's behavior
    system_prompt = "You are an AI assistant specialized in breaking down complex questions. Your task is to decompose complex queries into simpler sub-questions that, when answered together, address the original query."
    
    # Define the user prompt with the original query to be decomposed
    user_prompt = f"""
    Break down the following complex query into {num_subqueries} simpler sub-queries. Each sub-query should focus on a different aspect of the original question.
    
    Original query: {original_query}
    
    Generate {num_subqueries} sub-queries, one per line, in this format:
    1. [First sub-query]
    2. [Second sub-query]
    And so on...
    """
    
    # Generate the sub-queries using the specified model
    response = client.chat.completions.create(
        model=model,
        temperature=0.4,  # Slightly higher temperature for some variation
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ]
    )
    
    # Process the response to extract sub-queries
    content = response.choices[0].message.content.strip()
    
    # Extract numbered queries using simple parsing
    lines = content.split("\n")
    sub_queries = []
    
    for line in lines:
        if line.strip() and any(line.strip().startswith(f"{i}.") for i in range(1, 10)):
            # Remove the number and leading space
            query = line.strip()
            query = query[query.find(".")+1:].strip()
            sub_queries.append(query)
    
    return sub_queries

## Demonstrating Query Transformation Techniques
Let's apply these techniques to an example query.

In [6]:
import google.generativeai as genai
import os

# Assume genai.configure(api_key=os.getenv("GEMINI_API_KEY")) has been run.
# I'll define a helper function to make it easier.

def call_gemini(prompt, model="gemini-pro", temperature=0):
    """A helper function to make a call to the Gemini API."""
    model_instance = genai.GenerativeModel(model)
    response = model_instance.generate_content(
        prompt,
        generation_config=genai.GenerationConfig(temperature=temperature)
    )
    return response.text.strip()

# --- Example query ---
original_query = "Where can I find emergency food in ZIP code 17104?"

# --- Function Definitions for Query Transformations ---

def rewrite_query(query):
    """
    Rewrites the user query to be more effective for keyword-based retrieval.
    """
    prompt = f"""
    You are a query rewriting model. Your task is to rephrase the user's query
    to be more concise and suitable for a search engine. Do not add or remove
    any key information.

    User Query: {query}

    Rewritten Query:
    """
    # This function would call your LLM
    # return call_gemini(prompt)
    
    # Placeholder for demonstration
    return "emergency food resources in 17104"

def generate_step_back_query(query):
    """
    Generates a step-back query to get more general context.
    """
    prompt = f"""
    You are a helpful assistant. The user is asking a specific question.
    What is the high-level, "step-back" question that would provide
    the necessary context to answer the user's original question?

    Original Query: {query}

    Step-back Question:
    """
    # This function would call your LLM
    # return call_gemini(prompt)
    
    # Placeholder for demonstration
    return "What are the common methods for finding food assistance?"

def decompose_query(query, num_subqueries=3):
    """
    Decomposes a complex query into a set of simpler sub-queries.
    """
    prompt = f"""
    Decompose the following query into {num_subqueries} distinct and independent sub-queries.
    List each sub-query on a new line, starting with a number followed by a period.

    Original Query: {query}
    
    Decomposed Sub-queries:
    """
    # This function would call your LLM
    # raw_response = call_gemini(prompt)
    # return raw_response.split('\n')
    
    # Placeholder for demonstration
    return [
        "1. What are local food banks in ZIP code 17104?",
        "2. Are there any food pantries in the Harrisburg, PA area?",
        "3. How can I apply for food assistance programs?"
    ]

# --- Apply query transformations ---
print("Original Query:", original_query)

# Query Rewriting
rewritten_query = rewrite_query(original_query)
print("\n1. Rewritten Query:")
print(rewritten_query)

# Step-back Prompting
step_back_query = generate_step_back_query(original_query)
print("\n2. Step-back Query:")
print(step_back_query)

# Sub-query Decomposition
sub_queries = decompose_query(original_query, num_subqueries=3)
print("\n3. Sub-queries:")
for i, query in enumerate(sub_queries, 1):
    print(f"   {i}. {query}")

Original Query: Where can I find emergency food in ZIP code 17104?

1. Rewritten Query:
emergency food resources in 17104

2. Step-back Query:
What are the common methods for finding food assistance?

3. Sub-queries:
   1. 1. What are local food banks in ZIP code 17104?
   2. 2. Are there any food pantries in the Harrisburg, PA area?
   3. 3. How can I apply for food assistance programs?


## Building a Simple Vector Store
To demonstrate how query transformations integrate with retrieval, let's implement a simple vector store.

In [7]:
import numpy as np
import google.generativeai as genai
import os
from dotenv import load_dotenv

# --- Initialize Gemini client ---
# Load environment variables from a .env file
load_dotenv()

# Get the API key from the environment variable
api_key = os.getenv("GEMINI_API_KEY")



# Configure the Gemini API client
try:
    genai.configure(api_key=api_key)
except Exception as e:
    print(f"An error occurred during Gemini API configuration: {e}")
    exit()

class SimpleVectorStore:
    """
    A simple vector store implementation using NumPy.
    """
    def __init__(self):
        """
        Initialize the vector store.
        """
        self.vectors = []  # List to store embedding vectors
        self.texts = []  # List to store original texts
        self.metadata = []  # List to store metadata for each text
    
    def add_item(self, text, embedding, metadata=None):
        """
        Add an item to the vector store.

        Args:
        text (str): The original text.
        embedding (List[float]): The embedding vector.
        metadata (dict, optional): Additional metadata.
        """
        self.vectors.append(np.array(embedding))  # Convert embedding to numpy array and add to vectors list
        self.texts.append(text)  # Add the original text to texts list
        self.metadata.append(metadata or {})  # Add metadata to metadata list, use empty dict if None
    
    def similarity_search(self, query_embedding, k=5):
        """
        Find the most similar items to a query embedding.

        Args:
        query_embedding (List[float]): Query embedding vector.
        k (int): Number of results to return.

        Returns:
        List[Dict]: Top k most similar items with their texts and metadata.
        """
        if not self.vectors:
            return []  # Return empty list if no vectors are stored
        
        # Convert query embedding to numpy array
        query_vector = np.array(query_embedding)
        
        # Calculate similarities using cosine similarity
        similarities = []
        for i, vector in enumerate(self.vectors):
            # Compute cosine similarity between query vector and stored vector
            # Handle the case where a vector might have a norm of 0 to prevent division by zero
            norm_query = np.linalg.norm(query_vector)
            norm_vector = np.linalg.norm(vector)
            if norm_query == 0 or norm_vector == 0:
                similarity = 0.0
            else:
                similarity = np.dot(query_vector, vector) / (norm_query * norm_vector)
            similarities.append((i, similarity))  # Append index and similarity score
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top k results
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],  # Add the corresponding text
                "metadata": self.metadata[idx],  # Add the corresponding metadata
                "similarity": score  # Add the similarity score
            })
        
        return results  # Return the list of top k similar items

# --- Example Usage with Gemini ---

def create_gemini_embedding(text, model="models/embedding-001"):
    """Creates an embedding for a single text using Gemini."""
    response = genai.embed_content(model=model, content=text)
    return response['embedding']

if __name__ == '__main__':
    # 1. Initialize the vector store
    store = SimpleVectorStore()

    # 2. Add some items with their Gemini embeddings
    text1 = "How can I find emergency food?"
    emb1 = create_gemini_embedding(text1)
    store.add_item(text1, emb1, metadata={"source": "PA211_dataset"})

    text2 = "Where are the closest food banks in Harrisburg?"
    emb2 = create_gemini_embedding(text2)
    store.add_item(text2, emb2, metadata={"source": "PA211_dataset"})

    text3 = "What should I do during a flood?"
    emb3 = create_gemini_embedding(text3)
    store.add_item(text3, emb3, metadata={"source": "PEMA.pdf"})

    print("Vector store initialized with 3 items.")

    # 3. Perform a similarity search with a new query
    query_text = "find food assistance"
    query_embedding = create_gemini_embedding(query_text)

    print(f"\nSearching for items similar to: '{query_text}'")
    search_results = store.similarity_search(query_embedding, k=2)

    # 4. Print the search results
    print("\nTop 2 search results:")
    for result in search_results:
        print(f"  - Text: {result['text']}")
        print(f"    Similarity: {result['similarity']:.4f}")
        print(f"    Metadata: {result['metadata']}")

Vector store initialized with 3 items.

Searching for items similar to: 'find food assistance'

Top 2 search results:
  - Text: How can I find emergency food?
    Similarity: 0.7839
    Metadata: {'source': 'PA211_dataset'}
  - Text: Where are the closest food banks in Harrisburg?
    Similarity: 0.6575
    Metadata: {'source': 'PA211_dataset'}


In [8]:
class SimpleVectorStore:
    """
    A simple vector store implementation using NumPy.
    """
    def __init__(self):
        """
        Initialize the vector store.
        """
        self.vectors = []  # List to store embedding vectors
        self.texts = []  # List to store original texts
        self.metadata = []  # List to store metadata for each text
    
    def add_item(self, text, embedding, metadata=None):
        """
        Add an item to the vector store.

        Args:
        text (str): The original text.
        embedding (List[float]): The embedding vector.
        metadata (dict, optional): Additional metadata.
        """
        self.vectors.append(np.array(embedding))  # Convert embedding to numpy array and add to vectors list
        self.texts.append(text)  # Add the original text to texts list
        self.metadata.append(metadata or {})  # Add metadata to metadata list, use empty dict if None
    
    def similarity_search(self, query_embedding, k=5):
        """
        Find the most similar items to a query embedding.

        Args:
        query_embedding (List[float]): Query embedding vector.
        k (int): Number of results to return.

        Returns:
        List[Dict]: Top k most similar items with their texts and metadata.
        """
        if not self.vectors:
            return []  # Return empty list if no vectors are stored
        
        # Convert query embedding to numpy array
        query_vector = np.array(query_embedding)
        
        # Calculate similarities using cosine similarity
        similarities = []
        for i, vector in enumerate(self.vectors):
            # Compute cosine similarity between query vector and stored vector
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))  # Append index and similarity score
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top k results
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],  # Add the corresponding text
                "metadata": self.metadata[idx],  # Add the corresponding metadata
                "similarity": score  # Add the similarity score
            })
        
        return results  # Return the list of top k similar items

## Creating Embeddings

In [9]:
import google.generativeai as genai

# Make sure to initialize your Gemini API key before calling this function
# Example: genai.configure(api_key="YOUR_API_KEY")

def create_embeddings(text, model="models/embedding-001"):
    """
    Creates embeddings for the given text using the specified Gemini model.

    Args:
    text (str or list[str]): The input text(s) for which embeddings are to be created.
                                Can be a single string or a list of strings.
    model (str): The model to be used for creating embeddings. Defaults to "models/embedding-001".

    Returns:
    list[float] or list[list[float]]: The embedding vector(s).
    """
    # Gemini's embed_content can handle both a single string or a list of strings.
    # The output format is a list of embeddings, even for a single input.
    response = genai.embed_content(
        model=model,
        content=text
    )

    # If the original input was a string, return just the first embedding vector.
    if isinstance(text, str):
        return response['embedding']

    # Otherwise, return all embedding vectors as a list of lists.
    return response['embedding']

 Includes a SimpleVectorStore class that uses NumPy for in-memory storage of text, embeddings, and metadata, enabling basic cosine similarity searches.
It calculates cosine similarity between the query_vector and all stored vectors.


## Implementing RAG with Query Transformations

In [10]:
import google.generativeai as genai
import os

# --- Initialize Gemini client ---
# Ensure your API key is configured.
# genai.configure(api_key="YOUR_API_KEY")

def extract_text_with_gemini(pdf_path):
    """
    Extracts text from a PDF file using the Gemini model's document understanding capabilities.
    This function sends the entire PDF content to the model with a prompt to extract all text.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF, or an error message.
    """
    try:
        # Check if the file exists
        if not os.path.exists(pdf_path):
            return f"Error: File not found at {pdf_path}"

        # Read the PDF file as bytes
        with open(pdf_path, "rb") as pdf_file:
            pdf_data = pdf_file.read()

        # Create a GenerativeModel instance
        model = genai.GenerativeModel("gemini-pro-vision") # Or another suitable multimodal model

        # Create a prompt that instructs the model to extract all text
        prompt = "Extract all text from this document."

        # Prepare the parts for the generate_content call
        contents = [
            prompt,
            {
                'mime_type': 'application/pdf',
                'data': pdf_data
            }
        ]

        # Call the Gemini API to get the text
        response = model.generate_content(contents)

        # Return the extracted text
        return response.text.strip()

    except Exception as e:
        return f"An error occurred: {e}"

# Example Usage:
# pdf_file_path = "your_document.pdf"
# extracted_text = extract_text_with_gemini(pdf_file_path)
# print(extracted_text)

In [12]:
import fitz  # pip install PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the entire PDF.
    """
    # Open the PDF file
    doc = fitz.open(pdf_path)
    all_text = []

    # Iterate through each page in the PDF
    for page in doc:
        all_text.append(page.get_text("text"))

    doc.close()
    return "\n".join(all_text)

# Example usage:
pdf_file = "//Users/kekunkoya/Desktop/RAG Project/Resources.pdf"
text = extract_text_from_pdf(pdf_file)
print(text) 


Agency Name *
Site Name *
Service Name *
Site Main Phone 
Number
Service Eligibility
Lancaster County 
Housing and 
Redevelopment 
Authorities
Lancaster County Housing 
and Redevelopment 
Authorities
Home Repair Program
717-394-0793
Based on annual gross income, 
according to family size; Available 
equity in home
Housing and Repairs
Lancaster County 
Housing and 
Redevelopment 
Authorities
Lancaster County Housing 
and Redevelopment 
Authorities
Public Infrastructure and 
Community Facilities Grant 
Administration
717-394-0793
Local municipalities outside the 
city of Lancaster

Lancaster County 
Housing and 
Redevelopment 
Authorities
Lancaster County Housing 
and Redevelopment 
Authorities
Rental Housing Program
717-394-0793
Open to rental housing 
developers only; Properties must 
be located in Lancaster County, 
outside of Lancaster City
Lancaster County 
Housing and 
Redevelopment 
Authorities
Lancaster County Housing 
and Redevelopment 
Authorities
Grant Administration
717-394-0

In [1]:
import os
import fitz  # pip install PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the entire PDF.
    """
    doc = fitz.open(pdf_path)
    all_text = []
    for page in doc:
        all_text.append(page.get_text("text"))
    doc.close()
    return "\n".join(all_text)

def extract_texts_from_folder(folder_path: str):
    """
    Extracts text from all PDF files in a folder (recursively).
    Args:
        folder_path (str): Path to the folder containing PDFs.
    Returns:
        dict: {pdf_filename: extracted_text, ...}
    """
    pdf_texts = {}
    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.pdf'):
                pdf_path = os.path.join(root, file)
                try:
                    pdf_texts[pdf_path] = extract_text_from_pdf(pdf_path)
                except Exception as e:
                    print(f"Failed to extract {pdf_path}: {e}")
    return pdf_texts

# Example usage:
folder_path = "/Users/kekunkoya/Desktop/RAG Google 2/PDFs"
pdf_texts = extract_texts_from_folder(folder_path)

for pdf, text in pdf_texts.items():
    print(f"\n--- {os.path.basename(pdf)} ---")
    print(text[:500])  # Print the first 500 characters for preview



--- PA 211 Disaster Community Resources.pdf ---
PA 211 Community Disaster and Human 
Services Resources in Pennsylvania 
Introduction 
 
Community Disaster and Human Services Resources in Pennsylvania 
 
Disasters, whether natural or man-made, have significant and far-reaching impacts on 
individuals, families, and communities. Pennsylvania, with its mix of urban, suburban, and 
rural regions, faces a diverse array of emergencies ranging from floods and severe storms to 
public health crises and housing instability. To ensure an effective res

--- 211 RESPONDS TO URGENT NEEDS.pdf ---
211 RESPONDS TO URGENT NEEDS 
FACT
211 stood up a statewide text
response to support employees
impacted by the partial federal
government shutdown who did
not know when they would
receive their next paycheck.
211 assists in times of
disaster and widespread
need
FACT
FACT
1
PLEASE VOTE TO INCLUDE FUNDING FOR PENNSYLVANIA'S 211 SYSTEM IN THE STATE BUDGET TO
SUPPORT 211'S CAPACITY TO HELP OUR COMMUNITIES IN 

In [2]:
def chunk_text(text, n=1000, overlap=200):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks

    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

In [3]:
def process_document(pdf_path, chunk_size=1000, chunk_overlap=200):
    """
    Process a document for RAG using Gemini-compatible helper functions.

    Args:
    pdf_path (str): Path to the PDF file.
    chunk_size (int): Size of each chunk in characters.
    chunk_overlap (int): Overlap between chunks in characters.

    Returns:
    SimpleVectorStore: A vector store containing document chunks and their embeddings.
    """
    print("Extracting text from PDF...")
    # This assumes a Gemini-compatible `extract_text_from_pdf` function
    # is available, which might be a local library like PyMuPDF for efficiency.
    # We will use the local PyPDF2-based function for this example.
    extracted_text = extract_text_from_pdf(pdf_path)

    if not extracted_text:
        print("Failed to extract text from the PDF. Exiting.")
        return None

    print("Chunking text...")
    # The chunk_text function is generic and does not need modification.
    chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)
    print(f"Created {len(chunks)} text chunks")

    print("Creating embeddings for chunks...")
    # This assumes a Gemini-compatible `create_embeddings` function is available.
    # It will use the Gemini `embed_content` API.
    chunk_embeddings = create_embeddings(chunks)

    # Create and populate the vector store.
    store = SimpleVectorStore()

    if len(chunks) != len(chunk_embeddings):
        print("Error: Mismatch between number of chunks and embeddings.")
        return None

    for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings)):
        store.add_item(
            text=chunk,
            embedding=embedding,
            metadata={"index": i, "source": pdf_path}
        )

    print(f"Added {len(chunks)} chunks to the vector store")
    return store

## RAG with Query Transformations

In [4]:
import numpy as np
import google.generativeai as genai
import os
from dotenv import load_dotenv

# --- Helper functions and classes from previous conversions ---

# A simple vector store class (Gemini-agnostic)
class SimpleVectorStore:
    def __init__(self):
        self.vectors = []
        self.texts = []
        self.metadata = []
    
    def add_item(self, text, embedding, metadata=None):
        self.vectors.append(np.array(embedding))
        self.texts.append(text)
        self.metadata.append(metadata or {})
    
    def similarity_search(self, query_embedding, k=5):
        if not self.vectors:
            return []
        
        query_vector = np.array(query_embedding)
        similarities = []
        for i, vector in enumerate(self.vectors):
            norm_query = np.linalg.norm(query_vector)
            norm_vector = np.linalg.norm(vector)
            if norm_query == 0 or norm_vector == 0:
                similarity = 0.0
            else:
                similarity = np.dot(query_vector, vector) / (norm_query * norm_vector)
            similarities.append((i, similarity))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],
                "metadata": self.metadata[idx],
                "similarity": score
            })
        
        return results

# Gemini API helper function to create embeddings
def create_embeddings(text, model="models/embedding-001"):
    """
    Creates embeddings for the given text using the specified Gemini model.
    """
    if isinstance(text, str):
        response = genai.embed_content(model=model, content=text)
        return response['embedding']
    
    # If text is a list, create embeddings for all of them
    response = genai.embed_content(model=model, content=text)
    return response['embedding']

# A helper function to call the Gemini API
def call_gemini(prompt, model="gemini-2.0-flash", temperature=0):
    """A helper function to make a call to the Gemini API."""
    model_instance = genai.GenerativeModel(model)
    response = model_instance.generate_content(
        prompt,
        generation_config=genai.GenerationConfig(temperature=temperature)
    )
    return response.text.strip()

# Query Rewriting with Gemini
def rewrite_query(query):
    """
    Rewrites the user query to be more effective for keyword-based retrieval.
    """
    prompt = f"""
    You are a query rewriting model. Your task is to rephrase the user's query
    to be more concise and suitable for a search engine. Do not add or remove
    any key information.

    User Query: {query}

    Rewritten Query:
    """
    return call_gemini(prompt)

# Step-back Prompting with Gemini
def generate_step_back_query(query):
    """
    Generates a step-back query to get more general context.
    """
    prompt = f"""
    You are a helpful assistant. The user is asking a specific question.
    What is the high-level, "step-back" question that would provide
    the necessary context to answer the user's original question?

    Original Query: {query}

    Step-back Question:
    """
    return call_gemini(prompt)

# Sub-query Decomposition with Gemini
def decompose_query(query, num_subqueries=3):
    """
    Decomposes a complex query into a set of simpler sub-queries.
    """
    prompt = f"""
    Decompose the following query into {num_subqueries} distinct and independent sub-queries.
    List each sub-query on a new line, starting with a number followed by a period.

    Original Query: {query}
    
    Decomposed Sub-queries:
    """
    raw_response = call_gemini(prompt)
    # The response is a string, so we split it by lines
    sub_queries = [line.strip() for line in raw_response.split('\n') if line.strip()]
    return sub_queries

# --- The core function, unchanged in its logic ---
def transformed_search(query, vector_store, transformation_type, top_k=3):
    """
    Search using a transformed query.
    
    Args:
        query (str): Original query
        vector_store (SimpleVectorStore): Vector store to search
        transformation_type (str): Type of transformation ('rewrite', 'step_back', or 'decompose')
        top_k (int): Number of results to return
        
    Returns:
        List[Dict]: Search results
    """
    print(f"Transformation type: {transformation_type}")
    print(f"Original query: {query}")
    
    results = []
    
    if transformation_type == "rewrite":
        # Query rewriting
        transformed_query = rewrite_query(query)
        print(f"Rewritten query: {transformed_query}")
        
        # Create embedding for transformed query
        query_embedding = create_embeddings(transformed_query)
        
        # Search with rewritten query
        results = vector_store.similarity_search(query_embedding, k=top_k)
        
    elif transformation_type == "step_back":
        # Step-back prompting
        transformed_query = generate_step_back_query(query)
        print(f"Step-back query: {transformed_query}")
        
        # Create embedding for transformed query
        query_embedding = create_embeddings(transformed_query)
        
        # Search with step-back query
        results = vector_store.similarity_search(query_embedding, k=top_k)
        
    elif transformation_type == "decompose":
        # Sub-query decomposition
        sub_queries = decompose_query(query)
        print("Decomposed into sub-queries:")
        for i, sub_q in enumerate(sub_queries, 1):
            print(f"{i}. {sub_q}")
        
        # Create embeddings for all sub-queries
        sub_query_embeddings = create_embeddings(sub_queries)
        
        # Search with each sub-query and combine results
        all_results = []
        for i, embedding in enumerate(sub_query_embeddings):
            sub_results = vector_store.similarity_search(embedding, k=2)  # Get fewer results per sub-query
            all_results.extend(sub_results)
        
        # Remove duplicates (keep highest similarity score)
        seen_texts = {}
        for result in all_results:
            text = result["text"]
            if text not in seen_texts or result["similarity"] > seen_texts[text]["similarity"]:
                seen_texts[text] = result
        
        # Sort by similarity and take top_k
        results = sorted(seen_texts.values(), key=lambda x: x["similarity"], reverse=True)[:top_k]
        
    else:
        # Regular search without transformation
        query_embedding = create_embeddings(query)
        results = vector_store.similarity_search(query_embedding, k=top_k)
    
    return results

# --- Example Usage ---
if __name__ == '__main__':
    # You would need to load your data and create a vector store first.
    # For this example, we'll create a simple dummy store.
    
    load_dotenv()
    genai.configure(api_key=os.getenv("GEMINI_API_KEY"))

    vector_store = SimpleVectorStore()
    
    dummy_chunks = [
        "What are the available food resources in zip code 17104?",
        "How can I apply for SNAP benefits in Pennsylvania?",
        "Where can I find a homeless shelter?",
        "What are the procedures for a flash flood warning?",
        "I need help with my utility bills in Harrisburg.",
    ]
    
    for chunk in dummy_chunks:
        emb = create_embeddings(chunk)
        vector_store.add_item(chunk, emb)

    # --- Run an example search ---
    example_query = "Where can I get food assistance in Harrisburg, PA?"
    
    print("--- Standard Search ---")
    standard_results = transformed_search(example_query, vector_store, "none", top_k=2)
    for res in standard_results:
        print(f"Text: {res['text']} (Similarity: {res['similarity']:.4f})")

    print("\n--- Rewrite Search ---")
    rewrite_results = transformed_search(example_query, vector_store, "rewrite", top_k=2)
    for res in rewrite_results:
        print(f"Text: {res['text']} (Similarity: {res['similarity']:.4f})")

    print("\n--- Step-back Search ---")
    step_back_results = transformed_search(example_query, vector_store, "step_back", top_k=2)
    for res in step_back_results:
        print(f"Text: {res['text']} (Similarity: {res['similarity']:.4f})")

    print("\n--- Decompose Search ---")
    decompose_results = transformed_search(example_query, vector_store, "decompose", top_k=2)
    for res in decompose_results:
        print(f"Text: {res['text']} (Similarity: {res['similarity']:.4f})")

--- Standard Search ---
Transformation type: none
Original query: Where can I get food assistance in Harrisburg, PA?
Text: How can I apply for SNAP benefits in Pennsylvania? (Similarity: 0.8337)
Text: I need help with my utility bills in Harrisburg. (Similarity: 0.8133)

--- Rewrite Search ---
Transformation type: rewrite
Original query: Where can I get food assistance in Harrisburg, PA?
Rewritten query: Food assistance Harrisburg PA
Text: I need help with my utility bills in Harrisburg. (Similarity: 0.7999)
Text: How can I apply for SNAP benefits in Pennsylvania? (Similarity: 0.7880)

--- Step-back Search ---
Transformation type: step_back
Original query: Where can I get food assistance in Harrisburg, PA?
Step-back query: What are your specific needs and circumstances that make you ask about food assistance?
Text: How can I apply for SNAP benefits in Pennsylvania? (Similarity: 0.7096)
Text: What are the available food resources in zip code 17104? (Similarity: 0.7024)

--- Decompose Se

## Generating a Response with Transformed Queries

In [5]:
import google.generativeai as genai



def generate_response(query, context, model="gemini-2.0-flash"):
    """
    Generates a response based on the query and retrieved context using a Gemini model.

    Args:
        query (str): User query
        context (str): Retrieved context
        model (str): The model to use for response generation. Defaults to "gemini-pro".

    Returns:
        str: Generated response
    """
    # For Gemini, it's often best to combine the system prompt with the user's prompt
    # to guide the model's behavior, as a dedicated 'system' role isn't universally
    # supported in the same way as in the OpenAI Chat Completions API.

    # Combine the system prompt with the user's query and context.
    prompt = f"""
    You are a helpful AI assistant. Answer the user's question based only on the provided context. If you cannot find the answer in the context, state that you don't have enough information.

    Context:
    {context}

    Question: {query}

    Please provide a comprehensive answer based only on the context above.
    """
    
    # Initialize the Gemini GenerativeModel
    model_instance = genai.GenerativeModel(model)

    # Generate the response using the specified model.
    # We pass the combined prompt and set the generation temperature.
    try:
        response = model_instance.generate_content(
            prompt,
            generation_config=genai.GenerationConfig(
                temperature=0.1 # A low temperature for deterministic output
            )
        )
        
        # Access the generated text from the response object
        return response.text.strip()
    
    except Exception as e:
        # Handle potential errors, such as a model not being able to generate a response
        return f"An error occurred while generating the response: {e}"

# Example Usage (assuming you have context and query strings):
# context_text = "This is a document about the company's new policy on remote work. Remote work is allowed for all employees, provided they have a stable internet connection and get approval from their manager."
# user_query = "What is the policy on remote work?"
#
# response = generate_response(user_query, context_text)
# print(response)

## Running the Complete RAG Pipeline with Query Transformations

In [6]:
import os

def rag_with_query_transformation_for_folder(folder_path, query, transformation_type=None):
    """
    Run complete RAG pipeline with optional query transformation on all PDFs in a folder.

    Args:
        folder_path (str): Path to the folder containing PDF documents
        query (str): User query
        transformation_type (str): Type of transformation (None, 'rewrite', 'step_back', or 'decompose')

    Returns:
        List[Dict]: Results per PDF, each including filename, query, transformation, context, and response
    """
    results = []
    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.pdf'):
                pdf_path = os.path.join(root, file)
                print(f"Processing: {pdf_path}")

                # Process the document to create a vector store
                vector_store = process_document(pdf_path)

                # Apply query transformation and search
                if transformation_type:
                    search_results = transformed_search(query, vector_store, transformation_type)
                else:
                    query_embedding = create_embeddings(query)
                    search_results = vector_store.similarity_search(query_embedding, k=3)

                # Combine context from search results
                context = "\n\n".join([f"PASSAGE {i+1}:\n{result['text']}" for i, result in enumerate(search_results)])

                # Generate response based on the query and combined context
                response = generate_response(query, context)

                # Store results for this PDF
                results.append({
                    "pdf_file": file,
                    "original_query": query,
                    "transformation_type": transformation_type,
                    "context": context,
                    "response": response
                })
    return results



In [7]:
import os

def rag_with_query_transformation_folder(folder_path, query, transformation_type=None):
    """
    Run complete RAG pipeline with optional query transformation using Gemini-compatible functions on all PDFs in a folder.

    Args:
        folder_path (str): Path to folder containing PDF documents
        query (str): User query
        transformation_type (str): Type of transformation (None, 'rewrite', 'step_back', or 'decompose')

    Returns:
        List[Dict]: Results per PDF, each including file, query, context, and response (or error)
    """
    results = []
    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.pdf'):
                pdf_path = os.path.join(root, file)
                print(f"\n=== Processing: {pdf_path} ===")
                try:
                    # Process the document to create a vector store
                    print("Starting RAG pipeline...")
                    vector_store = process_document(pdf_path)
                except Exception as e:
                    results.append({"pdf_file": file, "error": f"Failed to process document: {e}"})
                    continue

                if vector_store is None:
                    results.append({"pdf_file": file, "error": "Vector store could not be created."})
                    continue

                # Apply query transformation and search
                if transformation_type and transformation_type in ['rewrite', 'step_back', 'decompose']:
                    print(f"Applying '{transformation_type}' transformation to the query.")
                    try:
                        search_results = transformed_search(query, vector_store, transformation_type)
                    except Exception as e:
                        results.append({"pdf_file": file, "error": f"Transformed search failed: {e}"})
                        continue
                else:
                    print("Performing a standard search without query transformation.")
                    try:
                        query_embedding = create_embeddings(query)
                        search_results = vector_store.similarity_search(query_embedding, k=3)
                    except Exception as e:
                        results.append({"pdf_file": file, "error": f"Standard search failed: {e}"})
                        continue

                # Combine context from search results
                context = "\n\n".join([f"PASSAGE {i+1}:\n{result['text']}" for i, result in enumerate(search_results)])
                
                # Generate response based on the query and combined context
                try:
                    print("Generating final response...")
                    response = generate_response(query, context)
                except Exception as e:
                    results.append({"pdf_file": file, "error": f"Response generation failed: {e}"})
                    continue

                print("Pipeline complete.")
                results.append({
                    "pdf_file": file,
                    "original_query": query,
                    "transformation_type": transformation_type,
                    "context": context,
                    "response": response
                })
    return results



## Evaluating Transformation Techniques

In [8]:
import google.generativeai as genai

# Assume genai.configure(api_key="YOUR_API_KEY") has been called.

def compare_responses(results, reference_answer, model="gemini-2.0-flash"):
    """
    Compare responses from different query transformation techniques using a Gemini model.

    Args:
        results (Dict): Results from different transformation techniques
        reference_answer (str): Reference answer for comparison
        model (str): The model to use for evaluation. Defaults to "gemini-2.0-flash".
    """
    # Define the system prompt to guide the AI assistant's behavior
    system_prompt = """You are an expert evaluator of RAG systems. 
    Your task is to compare different responses generated using various query transformation techniques 
    and determine which technique produced the best response compared to the reference answer."""

    # Prepare the comparison text with the reference answer and responses from each technique
    comparison_text = f"""Reference Answer: {reference_answer}\n\n"""
    
    for technique, result in results.items():
        # Ensure the 'response' key exists and is a string
        response_content = result.get('response', 'No response found.')
        comparison_text += f"{technique.capitalize()} Query Response:\n{response_content}\n\n"
    
    # Define the user prompt with the comparison text
    user_prompt = f"""
    {comparison_text}
    
    Compare the responses generated by different query transformation techniques to the reference answer.
    
    For each technique (e.g., original, rewrite, step_back, decompose):
    1. Score the response from 1-10 based on accuracy, completeness, and relevance
    2. Identify strengths and weaknesses
    
    Then rank the techniques from best to worst and explain which technique performed best overall and why.
    """
    
    # Combine the system prompt and user prompt for the Gemini API call
    full_prompt = f"{system_prompt}\n\n{user_prompt}"

    # Generate the evaluation response using the specified model
    try:
        model_instance = genai.GenerativeModel(model)
        response = model_instance.generate_content(
            full_prompt,
            generation_config=genai.GenerationConfig(
                temperature=0.1 # Low temperature for deterministic output
            )
        )
        
        # Print the evaluation results
        print("\n===== EVALUATION RESULTS =====")
        print(response.text.strip())
        print("=============================")

    except Exception as e:
        print(f"\nAn error occurred during evaluation: {e}")



In [9]:
import os

def evaluate_transformations_folder(folder_path, query, reference_answer=None):
    """
    Evaluate different transformation techniques for the same query on all PDFs in a folder.

    Args:
        folder_path (str): Path to folder containing PDF documents
        query (str): Query to evaluate
        reference_answer (str): Optional reference answer for comparison

    Returns:
        Dict: Results per PDF, each with results for each transformation type
    """
    # Transformation types to evaluate
    transformation_types = [None, "rewrite", "step_back", "decompose"]
    all_results = {}

    for root, _, files in os.walk(folder_path):
        for file in files:
            if file.lower().endswith('.pdf'):
                pdf_path = os.path.join(root, file)
                print(f"\n\n===== Evaluating PDF: {pdf_path} =====")
                results = {}

                for transformation_type in transformation_types:
                    type_name = transformation_type if transformation_type else "original"
                    print(f"\n--- Running RAG with {type_name} query ---")

                    # Run RAG for the transformation type
                    result = rag_with_query_transformation(pdf_path, query, transformation_type)
                    results[type_name] = result

                    # Print response or error
                    if "response" in result:
                        print(f"Response with {type_name} query:\n{result['response'][:400]}")  # Preview
                    else:
                        print(f"Error with {type_name} query: {result.get('error')}")
                    print("-" * 50)
                
                # Optionally compare responses if a reference is given
                if reference_answer:
                    compare_responses(results, reference_answer)
                
                # Save all results for this PDF
                all_results[file] = results

    return all_results



## Evaluation of Query Transformations

In [10]:
def evaluate_transformations(pdf_paths, query, reference_answer=None):
    """
    Evaluate different transformation techniques for the same query using Gemini-compatible functions.

    Args:
        pdf_path (str): Path to PDF document
        query (str): Query to evaluate
        reference_answer (str): Optional reference answer for comparison

    Returns:
        Dict: Evaluation results
    """
    # Define the transformation techniques to evaluate
    transformation_types = [None, "rewrite", "step_back", "decompose"]
    results = {}
    
    # Run RAG with each transformation technique
    for transformation_type in transformation_types:
        type_name = transformation_type if transformation_type else "original"
        print(f"\n===== Running RAG with {type_name} query =====")
        
        # Get the result for the current transformation type
        result = rag_with_query_transformation(pdf_path, query, transformation_type)
        results[type_name] = result
        
        # --- CORRECTED CODE STARTS HERE ---
        # Check for an error key in the result before trying to access 'response'
        if "error" in result:
            print(f"Error during RAG with {type_name} query: {result['error']}")
        else:
            # Print the response for the current transformation type
            print(f"Response with {type_name} query:")
            print(result["response"])
        # --- CORRECTED CODE ENDS HERE ---

        print("=" * 50)
    
    # Compare results if a reference answer is provided
    if reference_answer:
        compare_responses(results, reference_answer)
    
    return results


In [11]:
import json

def print_evaluation_results(results):
    """
    Prints the evaluation results from different RAG query transformation techniques.

    Args:
        results (Dict): The dictionary containing evaluation results for each technique.
    """
    print("\n\n===== FINAL EVALUATION REPORT =====")
    
    # Check if any results were produced
    if not results:
        print("No evaluation results to display.")
        return

    # Print a summary for each transformation type
    for technique, result in results.items():
        print(f"\n--- {technique.upper()} QUERY RESULTS ---")
        
        # Check for and handle errors
        if "error" in result:
            print(f"Error: {result['error']}")
            continue
        
        print(f"  Original Query: {result.get('original_query', 'N/A')}")
        print(f"  Transformation Type: {result.get('transformation_type', 'N/A')}")
        print("\n  Generated Response:")
        print("  " + result.get('response', 'No response generated.').replace('\n', '\n  '))
        
        # You may also want to print the context to understand why the response was generated
        print("\n  Retrieved Context:")
        print("  " + result.get('context', 'No context retrieved.').replace('\n', '\n  '))

    print("\n\n===== END OF REPORT =====")

# --- Example Usage (assuming 'evaluation_results' is populated) ---
# Example dictionary structure (replace with your actual results)
evaluation_results = {
    'original': {
        'original_query': 'What is in the emergency kit?',
        'transformation_type': None,
        'context': 'This is a test context about the emergency kit.',
        'response': 'The emergency kit contains food, water, and first-aid supplies.'
    },
    'rewrite': {
        'original_query': 'What is in the emergency kit?',
        'transformation_type': 'rewrite',
        'context': 'This is a test context about the emergency kit.',
        'response': 'The emergency kit contains food, water, and first-aid supplies.'
    }
}



# Call the print function
print_evaluation_results(evaluation_results)



===== FINAL EVALUATION REPORT =====

--- ORIGINAL QUERY RESULTS ---
  Original Query: What is in the emergency kit?
  Transformation Type: None

  Generated Response:
  The emergency kit contains food, water, and first-aid supplies.

  Retrieved Context:
  This is a test context about the emergency kit.

--- REWRITE QUERY RESULTS ---
  Original Query: What is in the emergency kit?
  Transformation Type: rewrite

  Generated Response:
  The emergency kit contains food, water, and first-aid supplies.

  Retrieved Context:
  This is a test context about the emergency kit.


===== END OF REPORT =====
