# Proposition Chunking for Enhanced RAG

In this notebook, I implement proposition chunking - an advanced technique to break down documents into atomic, factual statements for more accurate retrieval. Unlike traditional chunking that simply divides text by character count, proposition chunking preserves the semantic integrity of individual facts.

Proposition chunking delivers more precise retrieval by:

1. Breaking content into atomic, self-contained facts
2. Creating smaller, more granular units for retrieval  
3. Enabling more precise matching between queries and relevant content
4. Filtering out low-quality or incomplete propositions

Let's build a complete implementation without relying on LangChain or FAISS.

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import fitz # PyMuPDF
import os
import numpy as np
import json
import re
import google.generativeai as genai

In [2]:

import fitz
import os
import google.generativeai as genai
from dotenv import load_dotenv


## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [3]:
import fitz
from typing import List, Dict

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file using PyMuPDF (fitz).

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the PDF, or an empty string if an error occurs.
    """
    all_text = ""
    try:
        # Use a context manager to automatically close the document
        with fitz.open(pdf_path) as mypdf:
            # Iterate through each page to extract text
            for page in mypdf:
                all_text += page.get_text("text") + " "
    except Exception as e:
        print(f"Error reading PDF file: {e}")
        return ""
    
    return all_text

In [4]:
def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the entire PDF.
    """
    # Open the PDF file
    doc = fitz.open(pdf_path)
    all_text = []

    # Iterate through each page in the PDF
    for page in doc:
        all_text.append(page.get_text("text"))

    doc.close()
    return "\n".join(all_text)

# Example usage:
pdf_file = "/Users/kekunkoya/Desktop/ISEM 770 Class Project/Homelessness.pdf"
text = extract_text_from_pdf(pdf_file)
print(text) 

19
Defining and Measuring Homelessness
Volker Busch-Geertsema
GISS, Germany
>> Abstract_ Substantial progress has been made at EU level on defining home-
lessness. The European Typology on Homelessness and Housing Exclusion 
(ETHOS) is widely accepted in almost all European countries (and beyond) as 
a useful conceptual framework and almost everywhere definitions at national 
level (though often not identical with ETHOS) are discussed in relation to this 
typology. The development and some of the remaining controversial issues 
concerning ETHOS and a reduced version of it are discussed in this chapter. 
Furthermore essential reasons and different approaches to measure home-
lessness are presented. It is argued that a single number will not be enough 
to understand homelessness and monitor progress in tackling it. More 
research and more work to improve information on homelessness at national 
levels will be needed before we can achieve comparable numbers at EU level.
>> Keywords_ Data,

## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [5]:
def chunk_text(text, chunk_size=800, overlap=100):
    """
    Split text into overlapping chunks.
    
    Args:
        text (str): Input text to chunk
        chunk_size (int): Size of each chunk in characters
        overlap (int): Overlap between chunks in characters
        
    Returns:
        List[Dict]: List of chunk dictionaries with text and metadata
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Iterate over the text with the specified chunk size and overlap
    for i in range(0, len(text), chunk_size - overlap):
        chunk = text[i:i + chunk_size]  # Extract a chunk of the specified size
        if chunk:  # Ensure we don't add empty chunks
            chunks.append({
                "text": chunk,  # The chunk text
                "chunk_id": len(chunks) + 1,  # Unique ID for the chunk
                "start_char": i,  # Starting character index of the chunk
                "end_char": i + len(chunk)  # Ending character index of the chunk
            })
    
    print(f"Created {len(chunks)} text chunks")  # Print the number of created chunks
    return chunks  # Return the list of chunks

## Simple Vector Store Implementation
We'll create a basic vector store to manage document chunks and their embeddings.

In [6]:
class SimpleVectorStore:
    """
    A simple vector store implementation using NumPy.
    """
    def __init__(self):
        # Initialize lists to store vectors, texts, and metadata
        self.vectors = []
        self.texts = []
        self.metadata = []
    
    def add_item(self, text, embedding, metadata=None):
        """
        Add an item to the vector store.
        
        Args:
            text (str): The text content
            embedding (List[float]): The embedding vector
            metadata (Dict, optional): Additional metadata
        """
        # Append the embedding, text, and metadata to their respective lists
        self.vectors.append(np.array(embedding))
        self.texts.append(text)
        self.metadata.append(metadata or {})
    
    def add_items(self, texts, embeddings, metadata_list=None):
        """
        Add multiple items to the vector store.
        
        Args:
            texts (List[str]): List of text contents
            embeddings (List[List[float]]): List of embedding vectors
            metadata_list (List[Dict], optional): List of metadata dictionaries
        """
        # If no metadata list is provided, create an empty dictionary for each text
        if metadata_list is None:
            metadata_list = [{} for _ in range(len(texts))]
        
        # Add each text, embedding, and metadata to the store
        for text, embedding, metadata in zip(texts, embeddings, metadata_list):
            self.add_item(text, embedding, metadata)
    
    def similarity_search(self, query_embedding, k=5):
        """
        Find the most similar items to a query embedding.
        
        Args:
            query_embedding (List[float]): Query embedding vector
            k (int): Number of results to return
            
        Returns:
            List[Dict]: Top k most similar items
        """
        # Return an empty list if there are no vectors in the store
        if not self.vectors:
            return []
        
        # Convert query embedding to a numpy array
        query_vector = np.array(query_embedding)
        
        # Calculate similarities using cosine similarity
        similarities = []
        for i, vector in enumerate(self.vectors):
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))
        
        # Sort by similarity in descending order
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Collect the top k results
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],
                "metadata": self.metadata[idx],
                "similarity": float(score)  # Convert to float for JSON serialization
            })
        
        return results

## Creating Embeddings

In [8]:
import os
import google.generativeai as genai
from typing import List, Any
import numpy as np

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the create_embeddings function for Gemini ---
def create_embeddings(text: str or List[str], model: str = "models/embedding-001") -> Any:
    """
    Creates embeddings for the given text or list of texts using the Gemini API.

    Args:
    text (str or List[str]): The input text(s) for which embeddings are to be created.
    model (str): The model to be used for creating embeddings. Default is "models/embedding-001".

    Returns:
    List[float] or List[List[float]]: The embedding vector(s).
    """
    try:
        # The Gemini API can handle both single strings and lists of strings
        response = genai.embed_content(
            model=model,
            content=text
        )
        
        # If the input was a single string, the response has a single embedding.
        # If the input was a list, the response is a list of embeddings.
        return response['embedding']

    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return []

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Example 1: Create an embedding for a single string
    single_text = "Homelessness is a complex social issue."
    embedding = create_embeddings(single_text)
    print(f"Embedding for single text (first 5 values): {embedding[:5]}")
    
    # Example 2: Create embeddings for a list of strings
    list_of_texts = [
        "A lack of affordable housing is a key contributing factor.",
        "Social factors also play a role in homelessness."
    ]
    embeddings_list = create_embeddings(list_of_texts)
    print(f"\nNumber of embeddings for list: {len(embeddings_list)}")
    print(f"First embedding in list (first 5 values): {embeddings_list[0][:5]}")

Embedding for single text (first 5 values): [0.052571062, -0.03685706, -0.06520665, -0.04034025, 0.038206574]

Number of embeddings for list: 2
First embedding in list (first 5 values): [0.07521696, -0.034325134, -0.039195377, -0.008227663, 0.10222888]


## Proposition Generation

In [9]:
import os
import re
import google.generativeai as genai
from typing import Dict, List, Any

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. The main proposition generation function (revised for Gemini) ---
def generate_propositions(chunk: Dict, model: str = "gemini-1.5-flash") -> List[str]:
    """
    Generate atomic, self-contained propositions from a text chunk using Gemini.
    
    Args:
        chunk (Dict): Text chunk with content and metadata
        model (str): The model to be used for proposition generation.
        
    Returns:
        List[str]: List of generated propositions
    """
    system_prompt = """Please break down the following text into simple, self-contained propositions.
Ensure that each proposition meets the following criteria:

1. Express a Single Fact: Each proposition should state one specific fact or claim.
2. Be Understandable Without Context: The proposition should be self-contained, meaning it can be understood without needing additional context.
3. Use Full Names, Not Pronouns: Avoid pronouns or ambiguous references; use full entity names.
4. Include Relevant Dates/Qualifiers: If applicable, include necessary dates, times, and qualifiers to make the fact precise.
5. Contain One Subject-Predicate Relationship: Focus on a single subject and its corresponding action or attribute, without conjunctions or multiple clauses.

Output ONLY the list of propositions without any additional text or explanations."""

    user_prompt = f"Text to convert into propositions:\n\n{chunk['text']}"
    
    try:
        # Create a Gemini model instance with the system prompt
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # Generate the propositions
        response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 0})
        
        # Extract propositions from the response
        raw_propositions = response.text.strip().split('\n')
        
        # Clean up propositions (remove numbering, bullets, etc.)
        clean_propositions = []
        for prop in raw_propositions:
            cleaned = re.sub(r'^\s*(\d+\.|\-|\*)\s*', '', prop).strip()
            if cleaned and len(cleaned) > 10:
                clean_propositions.append(cleaned)
        
        return clean_propositions
    except Exception as e:
        print(f"An error occurred during proposition generation: {e}")
        return []

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a text chunk
    sample_chunk = {
        "text": "Homelessness is a complex social problem. A key factor is the lack of affordable housing, which disproportionately affects low-income families and individuals.",
        "metadata": {"source": "document.pdf", "chunk_index": 0}
    }
    
    print("Generating propositions with Gemini...")
    propositions = generate_propositions(sample_chunk)
    
    print("\nGenerated Propositions:")
    for prop in propositions:
        print(f"- {prop}")

Generating propositions with Gemini...

Generated Propositions:
- Homelessness is a complex social problem.
- A lack of affordable housing is a key factor in homelessness.
- Affordable housing is disproportionately lacking for low-income families.
- Affordable housing is disproportionately lacking for low-income individuals.


## Quality Checking for Propositions

In [10]:
import os
import json
import re
import google.generativeai as genai
from typing import Dict

# --- 1. Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. The main proposition evaluation function (revised for Gemini) ---
def evaluate_proposition(proposition: str, original_text: str, model: str = "gemini-1.5-flash") -> Dict:
    """
    Evaluate a proposition's quality based on accuracy, clarity, completeness, and conciseness.
    
    Args:
        proposition (str): The proposition to evaluate
        original_text (str): The original text for comparison
        model (str): The model to be used for evaluation.
        
    Returns:
        Dict: Scores for each evaluation dimension
    """
    system_prompt = """You are an expert at evaluating the quality of propositions extracted from text.
Rate the given proposition on the following criteria (scale 1-10):

- Accuracy: How well the proposition reflects information in the original text
- Clarity: How easy it is to understand the proposition without additional context
- Completeness: Whether the proposition includes necessary details (dates, qualifiers, etc.)
- Conciseness: Whether the proposition is concise without losing important information

The response must be in valid JSON format with numerical scores for each criterion:
{"accuracy": X, "clarity": X, "completeness": X, "conciseness": X}
"""

    user_prompt = f"""Proposition: {proposition}

Original Text: {original_text}

Please provide your evaluation scores in JSON format."""

    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # Generate the response, specifying JSON output
        response = gemini_model.generate_content(
            user_prompt,
            generation_config={"temperature": 0},
            # This is the recommended way to get structured JSON output
            response_mime_type="application/json"
        )
        
        # Parse the JSON response
        scores = json.loads(response.text.strip())
        return scores
    except (json.JSONDecodeError, Exception) as e:
        print(f"An error occurred during evaluation: {e}")
        # Fallback to a default score if an error occurs
        return {
            "accuracy": 5,
            "clarity": 5,
            "completeness": 5,
            "conciseness": 5
        }

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    sample_proposition = "Homelessness is a complex social problem with a variety of economic, social, and personal contributing factors."
    sample_original_text = "Homelessness, as a societal issue, is a state of not having a stable and safe place to live. It is a complex social problem with a variety of contributing factors, including economic, social, and personal issues."
    
    print("Evaluating proposition with Gemini...")
    evaluation_scores = evaluate_proposition(sample_proposition, sample_original_text)
    
    print("\nEvaluation Scores:")
    print(json.dumps(evaluation_scores, indent=2))

Evaluating proposition with Gemini...
An error occurred during evaluation: GenerativeModel.generate_content() got an unexpected keyword argument 'response_mime_type'

Evaluation Scores:
{
  "accuracy": 5,
  "clarity": 5,
  "completeness": 5,
  "conciseness": 5
}


## Complete Proposition Processing Pipeline

In [11]:
def process_document_into_propositions(pdf_path, chunk_size=800, chunk_overlap=100, 
                                      quality_thresholds=None):
    """
    Process a document into quality-checked propositions.
    
    Args:
        pdf_path (str): Path to the PDF file
        chunk_size (int): Size of each chunk in characters
        chunk_overlap (int): Overlap between chunks in characters
        quality_thresholds (Dict): Threshold scores for proposition quality
        
    Returns:
        Tuple[List[Dict], List[Dict]]: Original chunks and proposition chunks
    """
    # Set default quality thresholds if not provided
    if quality_thresholds is None:
        quality_thresholds = {
            "accuracy": 7,
            "clarity": 7,
            "completeness": 7,
            "conciseness": 7
        }
    
    # Extract text from the PDF file
    text = extract_text_from_pdf(pdf_path)
    
    # Create chunks from the extracted text
    chunks = chunk_text(text, chunk_size, chunk_overlap)
    
    # Initialize a list to store all propositions
    all_propositions = []
    
    print("Generating propositions from chunks...")
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1}/{len(chunks)}...")
        
        # Generate propositions for the current chunk
        chunk_propositions = generate_propositions(chunk)
        print(f"Generated {len(chunk_propositions)} propositions")
        
        # Process each generated proposition
        for prop in chunk_propositions:
            proposition_data = {
                "text": prop,
                "source_chunk_id": chunk["chunk_id"],
                "source_text": chunk["text"]
            }
            all_propositions.append(proposition_data)
    
    # Evaluate the quality of the generated propositions
    print("\nEvaluating proposition quality...")
    quality_propositions = []
    
    for i, prop in enumerate(all_propositions):
        if i % 10 == 0:  # Status update every 10 propositions
            print(f"Evaluating proposition {i+1}/{len(all_propositions)}...")
            
        # Evaluate the quality of the current proposition
        scores = evaluate_proposition(prop["text"], prop["source_text"])
        prop["quality_scores"] = scores
        
        # Check if the proposition passes the quality thresholds
        passes_quality = True
        for metric, threshold in quality_thresholds.items():
            if scores.get(metric, 0) < threshold:
                passes_quality = False
                break
        
        if passes_quality:
            quality_propositions.append(prop)
        else:
            print(f"Proposition failed quality check: {prop['text'][:50]}...")
    
    print(f"\nRetained {len(quality_propositions)}/{len(all_propositions)} propositions after quality filtering")
    
    return chunks, quality_propositions

## Building Vector Stores for Both Approaches

In [12]:
def build_vector_stores(chunks, propositions):
    """
    Build vector stores for both chunk-based and proposition-based approaches.
    
    Args:
        chunks (List[Dict]): Original document chunks
        propositions (List[Dict]): Quality-filtered propositions
        
    Returns:
        Tuple[SimpleVectorStore, SimpleVectorStore]: Chunk and proposition vector stores
    """
    # Create vector store for chunks
    chunk_store = SimpleVectorStore()
    
    # Extract chunk texts and create embeddings
    chunk_texts = [chunk["text"] for chunk in chunks]
    print(f"Creating embeddings for {len(chunk_texts)} chunks...")
    chunk_embeddings = create_embeddings(chunk_texts)
    
    # Add chunks to vector store with metadata
    chunk_metadata = [{"chunk_id": chunk["chunk_id"], "type": "chunk"} for chunk in chunks]
    chunk_store.add_items(chunk_texts, chunk_embeddings, chunk_metadata)
    
    # Create vector store for propositions
    prop_store = SimpleVectorStore()
    
    # Extract proposition texts and create embeddings
    prop_texts = [prop["text"] for prop in propositions]
    print(f"Creating embeddings for {len(prop_texts)} propositions...")
    prop_embeddings = create_embeddings(prop_texts)
    
    # Add propositions to vector store with metadata
    prop_metadata = [
        {
            "type": "proposition", 
            "source_chunk_id": prop["source_chunk_id"],
            "quality_scores": prop["quality_scores"]
        } 
        for prop in propositions
    ]
    prop_store.add_items(prop_texts, prop_embeddings, prop_metadata)
    
    return chunk_store, prop_store

## Query and Retrieval Functions

In [13]:
def retrieve_from_store(query, vector_store, k=5):
    """
    Retrieve relevant items from a vector store based on query.
    
    Args:
        query (str): User query
        vector_store (SimpleVectorStore): Vector store to search
        k (int): Number of results to retrieve
        
    Returns:
        List[Dict]: Retrieved items with scores and metadata
    """
    # Create query embedding
    query_embedding = create_embeddings(query)
    
    # Search vector store for the top k most similar items
    results = vector_store.similarity_search(query_embedding, k=k)
    
    return results

In [14]:
def compare_retrieval_approaches(query, chunk_store, prop_store, k=5):
    """
    Compare chunk-based and proposition-based retrieval for a query.
    
    Args:
        query (str): User query
        chunk_store (SimpleVectorStore): Chunk-based vector store
        prop_store (SimpleVectorStore): Proposition-based vector store
        k (int): Number of results to retrieve from each store
        
    Returns:
        Dict: Comparison results
    """
    print(f"\n=== Query: {query} ===")
    
    # Retrieve results from the proposition-based vector store
    print("\nRetrieving with proposition-based approach...")
    prop_results = retrieve_from_store(query, prop_store, k)
    
    # Retrieve results from the chunk-based vector store
    print("Retrieving with chunk-based approach...")
    chunk_results = retrieve_from_store(query, chunk_store, k)
    
    # Display proposition-based results
    print("\n=== Proposition-Based Results ===")
    for i, result in enumerate(prop_results):
        print(f"{i+1}) {result['text']} (Score: {result['similarity']:.4f})")
    
    # Display chunk-based results
    print("\n=== Chunk-Based Results ===")
    for i, result in enumerate(chunk_results):
        # Truncate text to keep the output manageable
        truncated_text = result['text'][:150] + "..." if len(result['text']) > 150 else result['text']
        print(f"{i+1}) {truncated_text} (Score: {result['similarity']:.4f})")
    
    # Return the comparison results
    return {
        "query": query,
        "proposition_results": prop_results,
        "chunk_results": chunk_results
    }

## Response Generation and Evaluation

In [15]:
import os
import google.generativeai as genai
from typing import List, Dict

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the response generator for Gemini ---
def generate_response(query: str, results: List[Dict], result_type: str = "proposition", model: str = "gemini-1.5-flash") -> str:
    """
    Generate a response based on the query and context using Gemini.

    Args:
        query (str): User query
        results (List[Dict]): Retrieved items
        result_type (str): Type of results ('proposition' or 'chunk')
        model (str): LLM model to use

    Returns:
        str: Generated response
    """
    # Combine retrieved texts into a single context string
    context = "\n\n".join([result["text"] for result in results])
    
    # System prompt to instruct the AI on how to generate the response
    system_prompt = f"""You are an AI assistant answering questions based on retrieved information.
Your answer should be based on the following {result_type}s that were retrieved from a knowledge base.
If the retrieved information doesn't answer the question, acknowledge this limitation."""

    # User prompt containing the query and the retrieved context
    user_prompt = f"""Query: {query}

Retrieved {result_type}s:
{context}

Please answer the query based on the retrieved information."""

    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        
        # Generate the response using the specified model
        response = gemini_model.generate_content(user_prompt, generation_config={"temperature": 0.2})
        
        # Return the generated response content
        return response.text
    except Exception as e:
        print(f"An error occurred during response generation: {e}")
        return "I could not generate a response due to an error."

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a query and results from a previous step
    query = "What is a major cause of homelessness?"
    results = [
        {"text": "A lack of affordable housing is a key contributing factor to homelessness."},
        {"text": "The sun is the star at the center of the Solar System."}
    ]
    
    print("Generating AI response with Gemini...")
    ai_response = generate_response(query, results, result_type="proposition")
    
    print("\nAI Response:")
    print(ai_response)

Generating AI response with Gemini...

AI Response:
Based on the provided information, a lack of affordable housing is a major contributing factor to homelessness.



## Complete End-to-End Evaluation Pipeline

In [16]:
def run_proposition_chunking_evaluation(pdf_path, test_queries, reference_answers=None):
    """
    Run a complete evaluation of proposition chunking vs standard chunking.
    
    Args:
        pdf_path (str): Path to the PDF file
        test_queries (List[str]): List of test queries
        reference_answers (List[str], optional): Reference answers for queries
        
    Returns:
        Dict: Evaluation results
    """
    print("=== Starting Proposition Chunking Evaluation ===\n")
    
    # Process document into propositions and chunks
    chunks, propositions = process_document_into_propositions(pdf_path)
    
    # Build vector stores for chunks and propositions
    chunk_store, prop_store = build_vector_stores(chunks, propositions)
    
    # Initialize a list to store results for each query
    results = []
    
    # Run tests for each query
    for i, query in enumerate(test_queries):
        print(f"\n\n=== Testing Query {i+1}/{len(test_queries)} ===")
        print(f"Query: {query}")
        
        # Get retrieval results from both chunk-based and proposition-based approaches
        retrieval_results = compare_retrieval_approaches(query, chunk_store, prop_store)
        
        # Generate responses based on the retrieved proposition-based results
        print("\nGenerating response from proposition-based results...")
        prop_response = generate_response(
            query, 
            retrieval_results["proposition_results"], 
            "proposition"
        )
        
        # Generate responses based on the retrieved chunk-based results
        print("Generating response from chunk-based results...")
        chunk_response = generate_response(
            query, 
            retrieval_results["chunk_results"], 
            "chunk"
        )
        
        # Get reference answer if available
        reference = None
        if reference_answers and i < len(reference_answers):
            reference = reference_answers[i]
        
        # Evaluate the generated responses
        print("\nEvaluating responses...")
        evaluation = evaluate_responses(query, prop_response, chunk_response, reference)
        
        # Compile results for the current query
        query_result = {
            "query": query,
            "proposition_results": retrieval_results["proposition_results"],
            "chunk_results": retrieval_results["chunk_results"],
            "proposition_response": prop_response,
            "chunk_response": chunk_response,
            "reference_answer": reference,
            "evaluation": evaluation
        }
        
        # Append the results to the overall results list
        results.append(query_result)
        
        # Print the responses and evaluation for the current query
        print("\n=== Proposition-Based Response ===")
        print(prop_response)
        
        print("\n=== Chunk-Based Response ===")
        print(chunk_response)
        
        print("\n=== Evaluation ===")
        print(evaluation)
    
    # Generate overall analysis of the evaluation
    print("\n\n=== Generating Overall Analysis ===")
    overall_analysis = generate_overall_analysis(results)
    print("\n" + overall_analysis)
    
    # Return the evaluation results, overall analysis, and counts of propositions and chunks
    return {
        "results": results,
        "overall_analysis": overall_analysis,
        "proposition_count": len(propositions),
        "chunk_count": len(chunks)
    }

In [17]:
def generate_overall_analysis(results):
    """
    Generate an overall analysis of proposition vs chunk approaches.
    
    Args:
        results (List[Dict]): Results from each test query
        
    Returns:
        str: Overall analysis
    """
    # System prompt to instruct the AI on how to generate the overall analysis
    system_prompt = """You are an expert at evaluating information retrieval systems.
    Based on multiple test queries, provide an overall analysis comparing proposition-based retrieval 
    to chunk-based retrieval for RAG (Retrieval-Augmented Generation) systems.

    Focus on:
    1. When proposition-based retrieval performs better
    2. When chunk-based retrieval performs better
    3. The overall strengths and weaknesses of each approach
    4. Recommendations for when to use each approach"""

    # Create a summary of evaluations for each query
    evaluations_summary = ""
    for i, result in enumerate(results):
        evaluations_summary += f"Query {i+1}: {result['query']}\n"
        evaluations_summary += f"Evaluation Summary: {result['evaluation'][:200]}...\n\n"

    # User prompt containing the summary of evaluations
    user_prompt = f"""Based on the following evaluations of proposition-based vs chunk-based retrieval across {len(results)} queries, 
    provide an overall analysis comparing these two approaches:

    {evaluations_summary}

    Please provide a comprehensive analysis on the relative strengths and weaknesses of proposition-based 
    and chunk-based retrieval for RAG systems."""

    # Generate the overall analysis using the OpenAI client
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )
    
    # Return the generated analysis text
    return response.choices[0].message.content

## Evaluation of Proposition Chunking

In [21]:
import os
import fitz
import numpy as np
import json
import re
import google.generativeai as genai
from typing import List, Dict, Any, Tuple

# --- Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- Helper functions (must be defined before being called) ---
def extract_text_from_pdf(pdf_path: str) -> str:
    all_text = ""
    try:
        with fitz.open(pdf_path) as doc:
            for page in doc:
                all_text += page.get_text("text")
    except Exception as e:
        print(f"Error reading PDF: {e}")
    return all_text

def chunk_text(text: str, n: int, overlap: int) -> List[str]:
    chunks = []
    for i in range(0, len(text), n - overlap):
        chunks.append(text[i:i + n])
    return chunks

def create_embeddings(texts: str or List[str], model: str = "models/embedding-001") -> Any:
    try:
        response = genai.embed_content(model=model, content=texts)
        return response['embedding']
    except Exception as e:
        print(f"Embedding error: {e}")
        return []

class SimpleVectorStore:
    def __init__(self):
        self.vectors = []
        self.texts = []
    def add_documents(self, documents, vectors):
        self.texts.extend(documents)
        self.vectors.extend([np.array(v) for v in vectors])
    def search(self, query_embedding, top_k=5):
        query_vector = np.array(query_embedding)
        similarities = [np.dot(query_vector, vec) / (np.linalg.norm(query_vector) * np.linalg.norm(vec)) if np.linalg.norm(query_vector) != 0 and np.linalg.norm(vec) != 0 else 0 for vec in self.vectors]
        top_indices = np.argsort(similarities)[::-1][:top_k]
        return [{"text": self.texts[i], "score": similarities[i]} for i in top_indices]

def generate_response(query: str, context: str, model: str = "gemini-1.5-flash") -> str:
    system_prompt = "You are a helpful assistant. Answer the user's question based only on the provided context."
    user_prompt = f"Context:\n{context}\n\nQuestion: {query}"
    try:
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        response = gemini_model.generate_content(user_prompt)
        return response.text
    except Exception as e:
        print(f"Response generation error: {e}")
        return "Error"

def get_propositions(chunk: Dict) -> List[str]:
    # This function would call an LLM to generate propositions.
    # For this example, we'll return a stub.
    return [f"Proposition for {chunk['text'][:20]}...", "Another proposition..."]

# This is the function that was not defined in your original snippet
def evaluate_responses(query: str, prop_response: str, chunk_response: str, reference: str) -> Dict[str, str]:
    # Placeholder for the actual evaluation logic
    print("Running placeholder evaluation...")
    return {"analysis": "This is a placeholder for the evaluation analysis."}

def run_proposition_chunking_evaluation(pdf_path: str, test_queries: List[str], reference_answers: List[str]) -> Dict:
    evaluation_results = {}
    
    # Process the PDF
    full_text = extract_text_from_pdf(pdf_path)
    chunks = chunk_text(full_text, n=800, overlap=0)
    
    # Run the evaluation loop for each query
    for i, query in enumerate(test_queries):
        reference = reference_answers[i] if i < len(reference_answers) else ""
        
        retrieval_results = {"proposition_results": [], "chunk_results": []}
        prop_response = "Placeholder proposition response."
        chunk_response = "Placeholder chunk response."
        
        evaluation = evaluate_responses(query, prop_response, chunk_response, reference)
        
        query_result = {
            "query": query,
            "proposition_results": retrieval_results["proposition_results"],
            "chunk_results": retrieval_results["chunk_results"],
            "proposition_response": prop_response,
            "chunk_response": chunk_response,
            "evaluation": evaluation
        }
        
        evaluation_results[f"query_{i+1}"] = query_result
    
    return evaluation_results

# --- Main Logic ---
pdf_path = "/Users/kekunkoya/Desktop/ISEM 770 Class Project/Homelessness.pdf"
test_queries = ["Why is a single number inadequate for understanding homelessness?"]
reference_answers = ["Section “How many homeless people are there?” and discussion of stock, flow, and prevalence figures"]

# Run the evaluation
evaluation_results = run_proposition_chunking_evaluation(
    pdf_path=pdf_path,
    test_queries=test_queries,
    reference_answers=reference_answers
)

print(evaluation_results)

Running placeholder evaluation...
{'query_1': {'query': 'Why is a single number inadequate for understanding homelessness?', 'proposition_results': [], 'chunk_results': [], 'proposition_response': 'Placeholder proposition response.', 'chunk_response': 'Placeholder chunk response.', 'evaluation': {'analysis': 'This is a placeholder for the evaluation analysis.'}}}
