# Self-RAG: A Dynamic Approach to RAG

In this notebook, I implement Self-RAG, an advanced RAG system that dynamically decides when and how to use retrieved information. Unlike traditional RAG approaches, Self-RAG introduces reflection points throughout the retrieval and generation process, resulting in higher quality and more reliable responses.

## Key Components of Self-RAG

1. **Retrieval Decision**: Determines if retrieval is even necessary for a given query
2. **Document Retrieval**: Fetches potentially relevant documents when needed  
3. **Relevance Evaluation**: Assesses how relevant each retrieved document is
4. **Response Generation**: Creates responses based on relevant contexts
5. **Support Assessment**: Evaluates if responses are properly grounded in the context
6. **Utility Evaluation**: Rates the overall usefulness of generated responses

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import fitz # PyMuPDF
import os
import numpy as np
import json
import re
import google.generativeai as genai

In [2]:

import fitz
import os
import google.generativeai as genai
from dotenv import load_dotenv


## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [5]:
def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the entire PDF.
    """
    # Open the PDF file
    doc = fitz.open(pdf_path)
    all_text = []

    # Iterate through each page in the PDF
    for page in doc:
        all_text.append(page.get_text("text"))

    doc.close()
    return "\n".join(all_text)

# Example usage:
pdf_file = "/Users/kekunkoya/Desktop/ISEM 770 Class Project/Homelessness.pdf"
text = extract_text_from_pdf(pdf_file)
print(text) 

19
Defining and Measuring Homelessness
Volker Busch-Geertsema
GISS, Germany
>> Abstract_ Substantial progress has been made at EU level on defining home-
lessness. The European Typology on Homelessness and Housing Exclusion 
(ETHOS) is widely accepted in almost all European countries (and beyond) as 
a useful conceptual framework and almost everywhere definitions at national 
level (though often not identical with ETHOS) are discussed in relation to this 
typology. The development and some of the remaining controversial issues 
concerning ETHOS and a reduced version of it are discussed in this chapter. 
Furthermore essential reasons and different approaches to measure home-
lessness are presented. It is argued that a single number will not be enough 
to understand homelessness and monitor progress in tackling it. More 
research and more work to improve information on homelessness at national 
levels will be needed before we can achieve comparable numbers at EU level.
>> Keywords_ Data,

## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [6]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

## Simple Vector Store Implementation
We'll create a basic vector store to manage document chunks and their embeddings.

In [7]:
class SimpleVectorStore:
    """
    A simple vector store implementation using NumPy.
    """
    def __init__(self):
        """
        Initialize the vector store.
        """
        self.vectors = []  # List to store embedding vectors
        self.texts = []  # List to store original texts
        self.metadata = []  # List to store metadata for each text
    
    def add_item(self, text, embedding, metadata=None):
        """
        Add an item to the vector store.

        Args:
        text (str): The original text.
        embedding (List[float]): The embedding vector.
        metadata (dict, optional): Additional metadata.
        """
        self.vectors.append(np.array(embedding))  # Convert embedding to numpy array and add to vectors list
        self.texts.append(text)  # Add the original text to texts list
        self.metadata.append(metadata or {})  # Add metadata to metadata list, default to empty dict if None
    
    def similarity_search(self, query_embedding, k=5, filter_func=None):
        """
        Find the most similar items to a query embedding.

        Args:
        query_embedding (List[float]): Query embedding vector.
        k (int): Number of results to return.
        filter_func (callable, optional): Function to filter results.

        Returns:
        List[Dict]: Top k most similar items with their texts and metadata.
        """
        if not self.vectors:
            return []  # Return empty list if no vectors are stored
        
        # Convert query embedding to numpy array
        query_vector = np.array(query_embedding)
        
        # Calculate similarities using cosine similarity
        similarities = []
        for i, vector in enumerate(self.vectors):
            # Apply filter if provided
            if filter_func and not filter_func(self.metadata[i]):
                continue
                
            # Calculate cosine similarity
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))  # Append index and similarity score
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top k results
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],  # Add the text
                "metadata": self.metadata[idx],  # Add the metadata
                "similarity": score  # Add the similarity score
            })
        
        return results  # Return the list of top k results

## Creating Embeddings

In [9]:


# --- 2. Define the create_embeddings function for Gemini ---
def create_embeddings(text: str or List[str], model: str = "models/embedding-001") -> Any:
    """
    Creates embeddings for the given text or list of texts using the Gemini API.

    Args:
    text (str or List[str]): The input text(s) for which embeddings are to be created.
    model (str): The model to be used for creating embeddings. Default is "models/embedding-001".

    Returns:
    List[float] or List[List[float]]: The embedding vector(s).
    """
    try:
        # The Gemini API can handle both single strings and lists of strings
        response = genai.embed_content(
            model=model,
            content=text
        )
        
        # If the input was a single string, the response has a single embedding.
        # If the input was a list, the response is a list of embeddings.
        return response['embedding']

    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return []

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Example 1: Create an embedding for a single string
    single_text = "Homelessness is a complex social issue."
    embedding = create_embeddings(single_text)
    print(f"Embedding for single text (first 5 values): {embedding[:5]}")
    
    # Example 2: Create embeddings for a list of strings
    list_of_texts = [
        "A lack of affordable housing is a key contributing factor.",
        "Social factors also play a role in homelessness."
    ]
    embeddings_list = create_embeddings(list_of_texts)
    print(f"\nNumber of embeddings for list: {len(embeddings_list)}")
    print(f"First embedding in list (first 5 values): {embeddings_list[0][:5]}")

Embedding for single text (first 5 values): [0.052571062, -0.03685706, -0.06520665, -0.04034025, 0.038206574]

Number of embeddings for list: 2
First embedding in list (first 5 values): [0.07521696, -0.034325134, -0.039195377, -0.008227663, 0.10222888]


## Document Processing Pipeline

In [10]:
def process_document(pdf_path, chunk_size=1000, chunk_overlap=200):
    """
    Process a document for Self-RAG.

    Args:
        pdf_path (str): Path to the PDF file.
        chunk_size (int): Size of each chunk in characters.
        chunk_overlap (int): Overlap between chunks in characters.

    Returns:
        SimpleVectorStore: A vector store containing document chunks and their embeddings.
    """
    # Extract text from the PDF file
    print("Extracting text from PDF...")
    extracted_text = extract_text_from_pdf(pdf_path)
    
    # Chunk the extracted text
    print("Chunking text...")
    chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)
    print(f"Created {len(chunks)} text chunks")
    
    # Create embeddings for each chunk
    print("Creating embeddings for chunks...")
    chunk_embeddings = create_embeddings(chunks)
    
    # Initialize the vector store
    store = SimpleVectorStore()
    
    # Add each chunk and its embedding to the vector store
    for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings)):
        store.add_item(
            text=chunk,
            embedding=embedding,
            metadata={"index": i, "source": pdf_path}
        )
    
    print(f"Added {len(chunks)} chunks to the vector store")
    return store

## Self-RAG Components
### 1. Retrieval Decision

In [11]:
import os
import google.generativeai as genai
from typing import List, Dict


# --- 2. Define the response generator for Gemini ---
def determine_if_retrieval_needed(query: str, model: str = "gemini-1.5-flash") -> bool:
    """
    Determines if retrieval is necessary for the given query using Gemini.
    
    Args:
        query (str): User query
        model (str): The model to be used for the determination.
        
    Returns:
        bool: True if retrieval is needed, False otherwise
    """
    # System prompt to instruct the AI on how to determine if retrieval is necessary
    system_prompt = """You are an AI assistant that determines if retrieval is necessary to answer a query.
For factual questions, specific information requests, or questions about events, people, or concepts, answer "Yes".
For opinions, hypothetical scenarios, or simple queries with common knowledge, answer "No".
Answer with ONLY "Yes" or "No"."""

    # User prompt containing the query
    user_prompt = f"Query: {query}\n\nIs retrieval necessary to answer this query accurately?"
    
    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # Generate the response
        response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 0})
        
        # Extract the answer from the model's response and convert to lowercase
        answer = response.text.strip().lower()
        
        # Return True if the answer contains "yes", otherwise return False
        return "yes" in answer
    except Exception as e:
        print(f"An error occurred during retrieval assessment: {e}")
        return False # Return False on error

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    test_query_1 = "What is the capital of France?" # Factual question, should be "Yes"
    test_query_2 = "What do you think is the best color?" # Opinion-based, should be "No"
    
    print(f"Query 1: '{test_query_1}'")
    retrieval_needed_1 = determine_if_retrieval_needed(test_query_1)
    print(f"Is retrieval needed? {retrieval_needed_1}")
    
    print(f"\nQuery 2: '{test_query_2}'")
    retrieval_needed_2 = determine_if_retrieval_needed(test_query_2)
    print(f"Is retrieval needed? {retrieval_needed_2}")

Query 1: 'What is the capital of France?'
Is retrieval needed? True

Query 2: 'What do you think is the best color?'
Is retrieval needed? False


### 2. Relevance Evaluation

In [12]:


# --- Define the relevance evaluation function for Gemini ---
def evaluate_relevance(query: str, context: str, model: str = "gemini-1.5-flash") -> str:
    """
    Evaluates the relevance of a context to the query using Gemini.
    
    Args:
        query (str): User query
        context (str): Context text
        model (str): The model to be used for the evaluation.
        
    Returns:
        str: 'relevant' or 'irrelevant'
    """
    # System prompt to instruct the AI on how to determine document relevance
    system_prompt = """You are an AI assistant that determines if a document is relevant to a query.
Consider whether the document contains information that would be helpful in answering the query.
Answer with ONLY "Relevant" or "Irrelevant"."""

    # Truncate context if it is too long to avoid exceeding token limits
    max_context_length = 2000
    if len(context) > max_context_length:
        context = context[:max_context_length] + "... [truncated]"

    # User prompt containing the query and the document content
    user_prompt = f"""Query: {query}
Document content:
{context}

Is this document relevant to the query? Answer with ONLY "Relevant" or "Irrelevant".
"""
    
    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # Generate the response
        response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 0})
        
        # Extract the answer from the model's response
        answer = response.text.strip().lower()
        
        # Return the relevance evaluation
        return "relevant" if "relevant" in answer else "irrelevant"
    except Exception as e:
        print(f"An error occurred during relevance evaluation: {e}")
        return "irrelevant" # Return irrelevant on error

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a query and context from a previous step
    relevant_query = "What is a major cause of homelessness?"
    relevant_context = "A key factor is the lack of affordable housing, which disproportionately affects low-income families and individuals. Social factors like family breakdown can also lead to homelessness."
    
    irrelevant_query = "Who was the first person to walk on the moon?"
    irrelevant_context = "Homelessness is a complex social problem with various contributing factors, including economic, social, and personal issues."
    
    # Evaluate a relevant context
    print(f"Query: '{relevant_query}'")
    relevance_1 = evaluate_relevance(relevant_query, relevant_context)
    print(f"Relevance: {relevance_1}")
    
    print("-" * 50)
    
    # Evaluate an irrelevant context
    print(f"Query: '{irrelevant_query}'")
    relevance_2 = evaluate_relevance(irrelevant_query, irrelevant_context)
    print(f"Relevance: {relevance_2}")

Query: 'What is a major cause of homelessness?'
Relevance: relevant
--------------------------------------------------
Query: 'Who was the first person to walk on the moon?'
Relevance: relevant


### 3. Support Assessment

In [13]:


# --- 2. The main support assessment function (revised for Gemini) ---
def assess_support(response: str, context: str, model: str = "gemini-1.5-flash") -> str:
    """
    Assesses how well a response is supported by the context using Gemini.

    Args:
        response (str): Generated response
        context (str): Context text
        model (str): Model to use for the assessment.

    Returns:
        str: 'fully supported', 'partially supported', or 'no support'
    """
    # System prompt to instruct the AI on how to evaluate support
    system_prompt = """You are an AI assistant that determines if a response is supported by the given context.
Evaluate if the facts, claims, and information in the response are backed by the context.
Answer with ONLY one of these three options:
- "Fully supported": All information in the response is directly supported by the context.
- "Partially supported": Some information in the response is supported by the context, but some is not.
- "No support": The response contains significant information not found in or contradicting the context.
"""

    # Truncate context if it is too long to avoid exceeding token limits
    max_context_length = 2000
    if len(context) > max_context_length:
        context = context[:max_context_length] + "... [truncated]"

    # User prompt containing the context and the response to be evaluated
    user_prompt = f"""Context:
{context}

Response:
{response}

How well is this response supported by the context? Answer with ONLY "Fully supported", "Partially supported", or "No support".
"""
    
    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # Generate the response
        api_response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 0})
        
        # Extract the answer from the model's response and convert to lowercase
        answer = api_response.text.strip().lower()
        
        return answer
    except Exception as e:
        print(f"An error occurred during support assessment: {e}")
        return "error"

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a response and context
    full_support_context = "The primary causes of homelessness are a lack of affordable housing and economic instability, such as job loss."
    full_support_response = "A lack of affordable housing and job loss are the main causes of homelessness."
    
    partial_support_context = "A lack of affordable housing is a major cause of homelessness."
    partial_support_response = "Homelessness is caused by a lack of affordable housing and also by mental health issues."
    
    no_support_context = "The sun is the star at the center of the solar system."
    no_support_response = "A lack of affordable housing is a key contributing factor to homelessness."
    
    print("Assessing a fully supported response...")
    assessment_1 = assess_support(full_support_response, full_support_context)
    print(f"Assessment: {assessment_1}")
    
    print("\nAssessing a partially supported response...")
    assessment_2 = assess_support(partial_support_response, partial_support_context)
    print(f"Assessment: {assessment_2}")
    
    print("\nAssessing a response with no support...")
    assessment_3 = assess_support(no_support_response, no_support_context)
    print(f"Assessment: {assessment_3}")

Assessing a fully supported response...
Assessment: fully supported

Assessing a partially supported response...
Assessment: partially supported

Assessing a response with no support...
Assessment: no support


### 4. Utility Evaluation

In [14]:

# --- 2. The main utility rating function (revised for Gemini) ---
def rate_utility(query: str, response: str, model: str = "gemini-1.5-flash") -> int:
    """
    Rates the utility of a response for the query using Gemini.
    
    Args:
        query (str): User query
        response (str): Generated response
        model (str): Model to use for the rating.
        
    Returns:
        int: Utility rating from 1 to 5
    """
    # System prompt to instruct the AI on how to rate the utility of the response
    system_prompt = """You are an AI assistant that rates the utility of a response to a query.
    Consider how well the response answers the query, its completeness, correctness, and helpfulness.
    Rate the utility on a scale from 1 to 5, where:
    - 1: Not useful at all
    - 2: Slightly useful
    - 3: Moderately useful
    - 4: Very useful
    - 5: Exceptionally useful
    Answer with ONLY a single number from 1 to 5."""

    # User prompt containing the query and the response to be rated
    user_prompt = f"""Query: {query}
    Response:
    {response}

    Rate the utility of this response on a scale from 1 to 5:"""
    
    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # Generate the utility rating
        api_response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 0})
        
        # Extract the rating from the model's response
        rating_text = api_response.text.strip()
        
        # Use regex to extract the number from the rating
        rating_match = re.search(r'[1-5]', rating_text)
        if rating_match:
            return int(rating_match.group())
        
        return 3 # Default to a middle rating if parsing fails
    except Exception as e:
        print(f"An error occurred during utility rating: {e}")
        return 3 # Default to a middle rating on error

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a query and a generated response
    sample_query = "What is the capital of France?"
    sample_response_1 = "The capital of France is Paris."
    sample_response_2 = "The capital of France is a city in Europe, but I don't know the name."
    
    print("Rating first response...")
    rating_1 = rate_utility(sample_query, sample_response_1)
    print(f"Response 1 rating: {rating_1}")
    
    print("\nRating second response...")
    rating_2 = rate_utility(sample_query, sample_response_2)
    print(f"Response 2 rating: {rating_2}")

Rating first response...
Response 1 rating: 5

Rating second response...
Response 2 rating: 1


## Response Generation

In [16]:
import os
import google.generativeai as genai
from typing import Optional

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the response generator for Gemini ---
def generate_response(query: str, context: Optional[str] = None, model: str = "gemini-1.5-flash") -> str:
    """
    Generates a response based on the query and optional context using Gemini.

    Args:
        query (str): User query
        context (str, optional): Context text
        model (str): LLM model to use

    Returns:
        str: Generated response
    """
    # System prompt to instruct the AI on how to generate a helpful response
    system_prompt = "You are a helpful AI assistant. Provide a clear, accurate, and informative response to the query."
    
    # Create the user prompt based on whether context is provided
    if context:
        user_prompt = f"""Context:
{context}

Query: {query}

Please answer the query based on the provided context."""
    else:
        user_prompt = f"""Query: {query}

Please answer the query to the best of your ability."""
    
    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # Generate the response using the specified model
        response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 0.5})
        
        # Return the generated response content
        return response.text.strip()
    except Exception as e:
        print(f"An error occurred during response generation: {e}")
        return "I could not generate a response due to an error."

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Example 1: RAG-style generation with context
    rag_query = "What are the main causes of homelessness?"
    rag_context = "Homelessness is a complex social problem. A key factor is the lack of affordable housing, which disproportionately affects low-income families and individuals."
    
    print("Generating RAG-style response with context...")
    rag_response = generate_response(rag_query, rag_context)
    print("RAG Response:", rag_response)
    
    print("-" * 50)
    
    # Example 2: Standard chat generation without context
    standard_query = "What is the capital of France?"
    
    print("Generating standard chat response without context...")
    standard_response = generate_response(standard_query)
    print("Standard Response:", standard_response)

Generating RAG-style response with context...
RAG Response: Based on the provided context, a main cause of homelessness is the lack of affordable housing.  This lack of affordable housing disproportionately impacts low-income families and individuals, making them more vulnerable to homelessness.
--------------------------------------------------
Generating standard chat response without context...
Standard Response: The capital of France is Paris.


## Complete Self-RAG Implementation

In [17]:
def self_rag(query, vector_store, top_k=3):
    """
    Implements the complete Self-RAG pipeline.
    
    Args:
        query (str): User query
        vector_store (SimpleVectorStore): Vector store containing document chunks
        top_k (int): Number of documents to retrieve initially
        
    Returns:
        dict: Results including query, response, and metrics from the Self-RAG process
    """
    print(f"\n=== Starting Self-RAG for query: {query} ===\n")
    
    # Step 1: Determine if retrieval is necessary
    print("Step 1: Determining if retrieval is necessary...")
    retrieval_needed = determine_if_retrieval_needed(query)
    print(f"Retrieval needed: {retrieval_needed}")
    
    # Initialize metrics to track the Self-RAG process
    metrics = {
        "retrieval_needed": retrieval_needed,
        "documents_retrieved": 0,
        "relevant_documents": 0,
        "response_support_ratings": [],
        "utility_ratings": []
    }
    
    best_response = None
    best_score = -1
    
    if retrieval_needed:
        # Step 2: Retrieve documents
        print("\nStep 2: Retrieving relevant documents...")
        query_embedding = create_embeddings(query)
        results = vector_store.similarity_search(query_embedding, k=top_k)
        metrics["documents_retrieved"] = len(results)
        print(f"Retrieved {len(results)} documents")
        
        # Step 3: Evaluate relevance of each document
        print("\nStep 3: Evaluating document relevance...")
        relevant_contexts = []
        
        for i, result in enumerate(results):
            context = result["text"]
            relevance = evaluate_relevance(query, context)
            print(f"Document {i+1} relevance: {relevance}")
            
            if relevance == "relevant":
                relevant_contexts.append(context)
        
        metrics["relevant_documents"] = len(relevant_contexts)
        print(f"Found {len(relevant_contexts)} relevant documents")
        
        if relevant_contexts:
            # Step 4: Process each relevant context
            print("\nStep 4: Processing relevant contexts...")
            for i, context in enumerate(relevant_contexts):
                print(f"\nProcessing context {i+1}/{len(relevant_contexts)}...")
                
                # Generate response based on the context
                print("Generating response...")
                response = generate_response(query, context)
                
                # Assess how well the response is supported by the context
                print("Assessing support...")
                support_rating = assess_support(response, context)
                print(f"Support rating: {support_rating}")
                metrics["response_support_ratings"].append(support_rating)
                
                # Rate the utility of the response
                print("Rating utility...")
                utility_rating = rate_utility(query, response)
                print(f"Utility rating: {utility_rating}/5")
                metrics["utility_ratings"].append(utility_rating)
                
                # Calculate overall score (higher for better support and utility)
                support_score = {
                    "fully supported": 3, 
                    "partially supported": 1, 
                    "no support": 0
                }.get(support_rating, 0)
                
                overall_score = support_score * 5 + utility_rating
                print(f"Overall score: {overall_score}")
                
                # Keep track of the best response
                if overall_score > best_score:
                    best_response = response
                    best_score = overall_score
                    print("New best response found!")
        
        # If no relevant contexts were found or all responses scored poorly
        if not relevant_contexts or best_score <= 0:
            print("\nNo suitable context found or poor responses, generating without retrieval...")
            best_response = generate_response(query)
    else:
        # No retrieval needed, generate directly
        print("\nNo retrieval needed, generating response directly...")
        best_response = generate_response(query)
    
    # Final metrics
    metrics["best_score"] = best_score
    metrics["used_retrieval"] = retrieval_needed and best_score > 0
    
    print("\n=== Self-RAG Completed ===")
    
    return {
        "query": query,
        "response": best_response,
        "metrics": metrics
    }

## Running the Complete Self-RAG System

In [18]:
def run_self_rag_example():
    """
    Demonstrates the complete Self-RAG system with examples.
    """
    # Process document
    pdf_path = "/Users/kekunkoya/Desktop/ISEM 770 Class Project/Homelessness.pdf"  # Path to the PDF document
    print(f"Processing document: {pdf_path}")
    vector_store = process_document(pdf_path)  # Process the document and create a vector store
    
    # Example 1: Query likely needing retrieval
    query1 = "What is the definition of homelessness?"
    print("\n" + "="*80)
    print(f"EXAMPLE 1: {query1}")
    result1 = self_rag(query1, vector_store)  # Run Self-RAG for the first query
    print("\nFinal response:")
    print(result1["response"])  # Print the final response for the first query
    print("\nMetrics:")
    print(json.dumps(result1["metrics"], indent=2))  # Print the metrics for the first query
    
    # Example 2: Query likely not needing retrieval
    query2 = "what are the domains of homlessness?"
    print("\n" + "="*80)
    print(f"EXAMPLE 2: {query2}")
    result2 = self_rag(query2, vector_store)  # Run Self-RAG for the second query
    print("\nFinal response:")
    print(result2["response"])  # Print the final response for the second query
    print("\nMetrics:")
    print(json.dumps(result2["metrics"], indent=2))  # Print the metrics for the second query
    
    # Example 3: Query with some relevance to document but requiring additional knowledge
    query3 = "How is homelessness measured?"
    print("\n" + "="*80)
    print(f"EXAMPLE 3: {query3}")
    result3 = self_rag(query3, vector_store)  # Run Self-RAG for the third query
    print("\nFinal response:")
    print(result3["response"])  # Print the final response for the third query
    print("\nMetrics:")
    print(json.dumps(result3["metrics"], indent=2))  # Print the metrics for the third query
    
    return {
        "example1": result1,
        "example2": result2,
        "example3": result3
    }

## Evaluating Self-RAG Against Traditional RAG

In [19]:
def traditional_rag(query, vector_store, top_k=3):
    """
    Implements a traditional RAG approach for comparison.
    
    Args:
        query (str): User query
        vector_store (SimpleVectorStore): Vector store containing document chunks
        top_k (int): Number of documents to retrieve
        
    Returns:
        str: Generated response
    """
    print(f"\n=== Running traditional RAG for query: {query} ===\n")
    
    # Retrieve documents
    print("Retrieving documents...")
    query_embedding = create_embeddings(query)  # Create embeddings for the query
    results = vector_store.similarity_search(query_embedding, k=top_k)  # Search for similar documents
    print(f"Retrieved {len(results)} documents")
    
    # Combine contexts from retrieved documents
    contexts = [result["text"] for result in results]  # Extract text from results
    combined_context = "\n\n".join(contexts)  # Combine texts into a single context
    
    # Generate response using the combined context
    print("Generating response...")
    response = generate_response(query, combined_context)  # Generate response based on the combined context
    
    return response

In [20]:
def evaluate_rag_approaches(pdf_path, test_queries, reference_answers=None):
    """
    Compare Self-RAG with traditional RAG.
    
    Args:
        pdf_path (str): Path to the document
        test_queries (List[str]): List of test queries
        reference_answers (List[str], optional): Reference answers for evaluation
        
    Returns:
        dict: Evaluation results
    """
    print("=== Evaluating RAG Approaches ===")
    
    # Process document to create a vector store
    vector_store = process_document(pdf_path)
    
    results = []
    
    for i, query in enumerate(test_queries):
        print(f"\nProcessing query {i+1}: {query}")
        
        # Run Self-RAG
        self_rag_result = self_rag(query, vector_store)  # Get response from Self-RAG
        self_rag_response = self_rag_result["response"]
        
        # Run traditional RAG
        trad_rag_response = traditional_rag(query, vector_store)  # Get response from traditional RAG
        
        # Compare results if reference answer is available
        reference = reference_answers[i] if reference_answers and i < len(reference_answers) else None
        comparison = compare_responses(query, self_rag_response, trad_rag_response, reference)  # Compare responses
        
        results.append({
            "query": query,
            "self_rag_response": self_rag_response,
            "traditional_rag_response": trad_rag_response,
            "reference_answer": reference,
            "comparison": comparison,
            "self_rag_metrics": self_rag_result["metrics"]
        })
    
    # Generate overall analysis
    overall_analysis = generate_overall_analysis(results)
    
    return {
        "results": results,
        "overall_analysis": overall_analysis
    }

In [21]:
import os
import google.generativeai as genai
from typing import Optional

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. The main comparison function (revised for Gemini) ---
def compare_responses(query: str, self_rag_response: str, trad_rag_response: str, reference: Optional[str] = None) -> str:
    """
    Compare responses from Self-RAG and traditional RAG using Gemini.

    Args:
        query (str): User query
        self_rag_response (str): Response from Self-RAG
        trad_rag_response (str): Response from traditional RAG
        reference (str, optional): Reference answer

    Returns:
        str: Comparison analysis
    """
    system_prompt = """You are an expert evaluator of RAG systems. Your task is to compare responses from two different RAG approaches:
1. Self-RAG: A dynamic approach that decides if retrieval is needed and evaluates information relevance and response quality
2. Traditional RAG: Always retrieves documents and uses them to generate a response

Compare the responses based on:
- Relevance to the query
- Factual correctness
- Completeness and informativeness
- Conciseness and focus"""

    user_prompt = f"""Query: {query}

Response from Self-RAG:
{self_rag_response}

Response from Traditional RAG:
{trad_rag_response}
"""

    if reference:
        user_prompt += f"""
Reference Answer (for factual checking):
{reference}
"""

    user_prompt += """
Compare these responses and explain which one is better and why.
Focus on accuracy, relevance, completeness, and quality.
"""

    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel("gemini-1.5-flash", system_instruction=system_prompt)
        
        # Generate the comparison using the specified model
        response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 2.0})
        
        # Return the generated response content
        return response.text
    except Exception as e:
        print(f"An error occurred during comparison: {e}")
        return "Comparison failed due to an error."
        
# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate responses and a query
    query = "What are the main causes of homelessness?"
    self_rag_response = "Homelessness is caused by a lack of affordable housing and social factors like family breakdown."
    trad_rag_response = "Homelessness is a complex problem with many contributing factors. The sun is a star at the center of the solar system."
    reference = "The main causes of homelessness are a lack of affordable housing and social and economic factors."

    print("Comparing responses with Gemini...")
    analysis = compare_responses(query, self_rag_response, trad_rag_response, reference)
    
    print("\nComparison Analysis:")
    print(analysis)

Comparing responses with Gemini...

Comparison Analysis:
## Comparison of RAG Responses on Homelessness

Here's a comparison of the two RAG responses based on the provided criteria:

**Self-RAG Response:**

* **Relevance:** Highly relevant. Directly addresses the query's core question.
* **Factual Correctness:** Mostly correct.  Identifying a lack of affordable housing as a major cause is accurate.  While "family breakdown" is a contributing factor, it's an oversimplification and lacks the breadth of other relevant social and economic factors.
* **Completeness and Informativeness:** Incomplete. It only mentions two broad causes, leaving out significant factors like unemployment, mental illness, addiction, and domestic violence.
* **Conciseness and Focus:** Concise and focused, but at the cost of comprehensiveness.


**Traditional RAG Response:**

* **Relevance:** Partially relevant. The first sentence attempts to address the query but is too general.  The second sentence ("The sun is a

In [22]:
import os
import google.generativeai as genai
from typing import List, Dict, Any

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. The main analysis function (revised for Gemini) ---
def generate_overall_analysis(results: List[Dict], model: str = "gemini-1.5-flash") -> str:
    """
    Generate an overall analysis of Self-RAG vs traditional RAG using Gemini.
    
    Args:
        results (List[Dict]): Results from evaluate_rag_approaches
        model (str): The model to be used for the analysis.
        
    Returns:
        str: Overall analysis
    """
    system_prompt = """You are an expert evaluator of RAG systems. Your task is to provide an overall analysis comparing
Self-RAG and Traditional RAG based on multiple test queries.

Focus your analysis on:
1. When Self-RAG performs better and why
2. When Traditional RAG performs better and why
3. The impact of dynamic retrieval decisions in Self-RAG
4. The value of relevance and support evaluation in Self-RAG
5. Overall recommendations on which approach to use for different types of queries"""
    
    # Prepare a summary of the individual comparisons (bug fixed)
    comparisons_summary = ""
    for i, result in enumerate(results):
        comparisons_summary += f"Query {i+1}: {result['query']}\n"
        comparisons_summary += f"Self-RAG metrics: Retrieval needed: {result['self_rag_metrics']['retrieval_needed']}, "
        comparisons_summary += f"Relevant docs: {result['self_rag_metrics']['relevant_documents']}/{result['self_rag_metrics']['documents_retrieved']}\n"
        comparisons_summary += f"Comparison summary: {result['comparison'][:200]}...\n\n"
        
    user_prompt = f"""Based on the following comparison results from {len(results)} test queries, please provide an overall analysis of
Self-RAG versus Traditional RAG:

{comparisons_summary}

Please provide your comprehensive analysis.
"""
    
    try:
        # Create a Gemini model instance with the system prompt
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # Generate the overall analysis
        response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 0.0})
        
        return response.text
    except Exception as e:
        print(f"An error occurred during analysis generation: {e}")
        return "Analysis failed due to an error."

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a results list from a previous evaluation pipeline
    simulated_results = [
        {
            "query": "What are the key differences between a black hole and a wormhole?",
            "self_rag_metrics": {"retrieval_needed": "Yes", "relevant_documents": 3, "documents_retrieved": 3},
            "comparison": "Self-RAG provided a highly accurate, concise comparison, leveraging its internal decision to retrieve documents on black holes and wormholes. Traditional RAG also retrieved relevant documents, but the Self-RAG's targeted approach led to a more focused response.",
        },
        {
            "query": "What is the primary cause of tides?",
            "self_rag_metrics": {"retrieval_needed": "No", "relevant_documents": 0, "documents_retrieved": 0},
            "comparison": "Self-RAG correctly identified this as a simple, common knowledge query and did not perform retrieval. Its response was accurate and direct. Traditional RAG's retrieval of irrelevant documents on astrophysics may have introduced unnecessary noise, though its final answer was still correct.",
        }
    ]

    print("Generating overall analysis with Gemini...")
    overall_analysis = generate_overall_analysis(simulated_results)
    
    print("\n=== OVERALL ANALYSIS ===")
    print(overall_analysis)

Generating overall analysis with Gemini...

=== OVERALL ANALYSIS ===
## Self-RAG vs. Traditional RAG: A Comparative Analysis

Based on the provided results for two test queries, we can begin to draw some conclusions about the relative strengths and weaknesses of Self-RAG and Traditional RAG systems.  However, two queries are insufficient for a definitive judgment; a more extensive evaluation with diverse query types is needed for robust conclusions.  The analysis below is therefore preliminary but highlights key observations.

**1. When Self-RAG Performs Better:**

* **Queries requiring no external knowledge:** Self-RAG excels when the query can be answered using the model's inherent knowledge base.  Query 2 ("What is the primary cause of tides?") demonstrates this.  By intelligently deciding *not* to retrieve external documents, Self-RAG avoids the potential pitfalls of irrelevant or conflicting information from a traditional RAG system. This dynamic retrieval decision is a key advant

## Evaluating the Self-RAG System

The final step is to evaluate the Self-RAG system against traditional RAG approaches. We'll compare the quality of responses generated by both systems and analyze the performance of Self-RAG in different scenarios.

In [32]:
import os
from openai import OpenAI
import fitz  # For PDF loading in the vector store stub

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# --- STUB IMPLEMENTATIONS ---
def load_vector_store(pdf_path: str):
    """
    Stub: pretend we load and index the PDF into a vector store.
    Replace with your actual vector store initialization.
    """
    text = fitz.open(pdf_path).get_page_text(0)  # just to use fitz
    return {"_stub_store": text}

def self_rag(query: str, store):
    """
    Stub: pretend Self-RAG returns a dict with a generated response.
    Replace with your actual adaptive retrieval + generation.
    """
    return {"response": f"Self-RAG answer for '{query}'"}

def standard_rag(query: str, store):
    """
    Stub: pretend traditional RAG returns a dict with a generated response.
    Replace with your actual standard retrieval + generation.
    """
    return {"response": f"Standard RAG answer for '{query}'"}

# --- HELPER FUNCTIONS USING VALID CHAT MODEL ---
def determine_if_retrieval_needed(query: str) -> bool:
    system_prompt = "Decide if retrieval is needed. Answer 'yes' or 'no' only."
    user_prompt   = f"Query: {query}\nIs retrieval necessary to answer this accurately?"
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role":"system","content":system_prompt},
            {"role":"user",  "content":user_prompt}
        ]
    )
    return resp.choices[0].message.content.strip().lower().startswith("yes")

def compare_responses(results: list[dict]) -> str:
    system_prompt = "Compare two RAG responses against a reference. Provide an overall summary."
    text = ""
    for r in results:
        text += (
            f"Query: {r['query']}\n"
            f" Self-RAG: {r['self_rag_response']}\n"
            f" Standard-RAG: {r['standard_rag_response']}\n"
            f" Reference: {r['reference_answer']}\n\n"
        )
    user_prompt = text + "Which method is better overall, and why?"
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role":"system","content":system_prompt},
            {"role":"user",  "content":user_prompt}
        ]
    )
    return resp.choices[0].message.content

# --- MAIN DRIVER ---
def evaluate_rag_approaches(pdf_path: str,
                            test_queries: list[str],
                            reference_answers: list[str]) -> dict:
    store = load_vector_store(pdf_path)
    comparison_results = []

    for query, ref in zip(test_queries, reference_answers):
        if not determine_if_retrieval_needed(query):
            self_resp     = standard_resp = "Retrieval skipped."
        else:
            self_resp     = self_rag(query, store)["response"]
            standard_resp = standard_rag(query, store)["response"]

        comparison_results.append({
            "query":                  query,
            "self_rag_response":      self_resp,
            "standard_rag_response":  standard_resp,
            "reference_answer":       ref
        })

    overall_analysis = compare_responses(comparison_results)
    return {
        "comparison":       comparison_results,
        "overall_analysis": overall_analysis
    }

# --- USAGE EXAMPLE ---
pdf_path = "/Users/kekunkoya/Desktop/ISEM 770 Class Project/Homelessness.pdf"
test_queries = [
    "What measurement approaches are commonly used to collect data on homelessness in Europe?",
]
reference_answers = [
    "Common approaches include point-in-time surveys (national and local counts), "
    "service-provider registers (NGO and municipal client data), and infrequent "
    "census-based surveys, each covering different ETHOS categories and providing "
    "either stock or prevalence data."
]

evaluation_results = evaluate_rag_approaches(
    pdf_path=pdf_path,
    test_queries=test_queries,
    reference_answers=reference_answers
)

print("\n=== OVERALL ANALYSIS ===\n")
print(evaluation_results["overall_analysis"])



=== OVERALL ANALYSIS ===

To evaluate the Self-RAG and Standard-RAG responses against the reference, we need to analyze their content, accuracy, and comprehensiveness.

**Self-RAG Response:**
- The Self-RAG response likely provides a personal interpretation or summary of the measurement approaches used to collect data on homelessness in Europe. However, without the actual content, we cannot assess its accuracy or detail.

**Standard-RAG Response:**
- The Standard-RAG response is expected to be more structured and aligned with established data sources and methodologies. It mentions specific approaches such as point-in-time surveys, service-provider registers, and census-based surveys, which are all relevant to the topic. It also references the ETHOS categories and distinguishes between stock and prevalence data, indicating a deeper understanding of the complexities involved in measuring homelessness.

**Reference:**
- The reference provides a clear and concise overview of the common me