# Document Augmentation RAG with Question Generation

This notebook implements an enhanced RAG approach using document augmentation through question generation. By generating relevant questions for each text chunk, we improve the retrieval process, leading to better responses from the language model.

In this implementation, we follow these steps:

1. **Data Ingestion**: Extract text from a PDF file.
2. **Chunking**: Split the text into manageable chunks.
3. **Question Generation**: Generate relevant questions for each chunk.
4. **Embedding Creation**: Create embeddings for both chunks and generated questions.
5. **Vector Store Creation**: Build a simple vector store using NumPy.
6. **Semantic Search**: Retrieve relevant chunks and questions for user queries.
7. **Response Generation**: Generate answers based on retrieved content.
8. **Evaluation**: Assess the quality of the generated responses.

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import fitz
import os
import numpy as np
import json
import google.generativeai as genai
import re
from tqdm import tqdm

# --- Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set. Please set it.")
genai.configure(api_key=GOOGLE_API_KEY)

## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [4]:
import fitz  # pip install PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the entire PDF.
    """
    # Open the PDF file
    doc = fitz.open(pdf_path)
    all_text = []

    # Iterate through each page in the PDF
    for page in doc:
        all_text.append(page.get_text("text"))

    doc.close()
    return "\n".join(all_text)

# Example usage:
pdf_file = "/Users/kekunkoya/Desktop/ISEM 770 Class Project/Homelessness.pdf"
text = extract_text_from_pdf(pdf_file)
print(text[:500])  # Print the first 500 characters to verify


19
Defining and Measuring Homelessness
Volker Busch-Geertsema
GISS, Germany
>> Abstract_ Substantial progress has been made at EU level on defining home-
lessness. The European Typology on Homelessness and Housing Exclusion 
(ETHOS) is widely accepted in almost all European countries (and beyond) as 
a useful conceptual framework and almost everywhere definitions at national 
level (though often not identical with ETHOS) are discussed in relation to this 
typology. The development and some of th


In [4]:
# --- 1. Your original PDF text extraction function ---
def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file.
    """
    doc = fitz.open(pdf_path)
    all_text = []
    for page in doc:
        all_text.append(page.get_text("text"))
    doc.close()
    return "\n".join(all_text)

# --- 2. Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 3. Function to summarize text using Gemini ---
def summarize_with_gemini(text_to_summarize: str) -> str:
    """
    Uses the Gemini API to generate a summary of the provided text.
    """
    try:
        model = genai.GenerativeModel('gemini-1.5-flash')
        prompt = f"Please provide a concise summary of the following text:\n\n{text_to_summarize}"
        response = model.generate_content(prompt)
        return response.text
    except Exception as e:
        print(f"Error during API call: {e}")
        return "Failed to generate summary."

# --- 4. Main Logic ---
if __name__ == "__main__":
    pdf_file = "/Users/kekunkoya/Desktop/770 Google /Homelessness.pdf"

    if not os.path.exists(pdf_file):
        print(f"Error: PDF file not found at '{pdf_file}'")
        exit()

    # Step A: Extract text from the PDF using your function
    print("Extracting text from PDF...")
    text = extract_text_from_pdf(pdf_file)
    print("Text extraction complete.")

    # Step B: Pass the extracted text to Gemini for summarization
    if text:
        print("\nGenerating summary with Gemini...")
        summary = summarize_with_gemini(text)
        print("\nSummary:")
        print(summary)
    else:
        print("No text was extracted from the PDF.")

Extracting text from PDF...
Text extraction complete.

Generating summary with Gemini...

Summary:
This chapter examines the progress and challenges in defining and measuring homelessness in Europe.  While the European Typology on Homelessness and Housing Exclusion (ETHOS) provides a widely accepted framework, inconsistencies remain in national definitions.  A simplified version, "ETHOS light," aims for greater harmonization, focusing on easily comparable categories.  The chapter argues that a single homelessness figure is insufficient;  multiple indicators (point-in-time, annual prevalence, inflow/outflow) are needed to understand and monitor the issue effectively.  Various data collection methods (surveys, registers, censuses) are discussed, highlighting the need for improved data collection strategies, political commitment, and transnational cooperation to achieve comparable European-wide data.  Future research should focus on clarifying ambiguous categories (long-term homelessness,

## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [6]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

## Generating Questions for Text Chunks
This is the key enhancement over simple RAG. We generate questions that could be answered by each text chunk.

In [7]:
import os
import google.generativeai as genai
import re
from typing import List

# --- 1. Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the question generation function for Gemini ---
def generate_questions(text_chunk: str, num_questions: int = 5, model: str = "gemini-1.5-flash") -> List[str]:
    """
    Generates relevant questions that can be answered from the given text chunk.

    Args:
    text_chunk (str): The text chunk to generate questions from.
    num_questions (int): Number of questions to generate.
    model (str): The model to use for question generation.

    Returns:
    List[str]: List of generated questions.
    """
    # Define the system prompt to guide the AI's behavior
    system_prompt = "You are an expert at generating relevant questions from text. Create concise questions that can be answered using only the provided text. Focus on key information and concepts."
    
    # Define the user prompt with the text chunk and the number of questions to generate
    user_prompt = f"""
Based on the following text, generate {num_questions} different questions that can be answered using only this text:

{text_chunk}

Format your response as a numbered list of questions only, with no additional text.
"""
    
    # Generate questions using the Gemini API
    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        response = gemini_model.generate_content(user_prompt)
        
        # Extract and clean questions from the response
        questions_text = response.text.strip()
        questions = []
        
        # Extract questions using regex pattern matching
        for line in questions_text.split('\n'):
            cleaned_line = re.sub(r'^\d+\.\s*', '', line.strip())
            if cleaned_line and cleaned_line.endswith('?'):
                questions.append(cleaned_line)
        
        return questions
    except Exception as e:
        print(f"An error occurred during question generation: {e}")
        return []

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a text chunk
    sample_chunk = """
    Homelessness is a complex social problem with various contributing factors, including economic, social, and personal issues. 
    A key factor is the lack of affordable housing, which disproportionately affects low-income families and individuals.
    """
    
    print("Generating questions with Gemini...")
    # Generate questions from the text chunk
    generated_questions = generate_questions(sample_chunk, num_questions=3)
    
    # Print the generated questions
    print("\nGenerated Questions:")
    for q in generated_questions:
        print(f"- {q}")

Generating questions with Gemini...

Generated Questions:
- What is a key factor contributing to homelessness?
- What types of issues contribute to homelessness?
- Which groups are disproportionately affected by a lack of affordable housing?


## Creating Embeddings for Text
We generate embeddings for both text chunks and generated questions.

In [8]:
import os
import google.generativeai as genai
import numpy as np

# --- 1. Gemini API Configuration ---
# Get your API key from an environment variable
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the create_embeddings function for Gemini ---
def create_embeddings(text: str, model: str = "models/embedding-001") -> np.ndarray:
    """
    Creates an embedding for the given text using the Gemini API.

    Args:
    text (str): The input text for which embeddings are to be created.
    model (str): The embedding model to be used. Default is "models/embedding-001".

    Returns:
    np.ndarray: The embedding vector as a NumPy array.
    """
    try:
        # Create embeddings using the specified model and input text
        response = genai.embed_content(model=model, content=text)
        # The embedding is located in the 'embedding' key of the response dictionary
        return np.array(response['embedding'], dtype=np.float32)
    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return np.array([])

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    sample_text = "Homelessness is a significant social issue."

    print("Creating embedding with Gemini...")
    embedding = create_embeddings(sample_text)

    if embedding.size > 0:
        print("\nEmbedding created successfully.")
        print(f"Embedding shape: {embedding.shape}")
        print(f"First 5 values: {embedding[:5]}")
    else:
        print("\nFailed to create embedding.")

Creating embedding with Gemini...

Embedding created successfully.
Embedding shape: (768,)
First 5 values: [ 0.05044351 -0.03356895 -0.06893376 -0.04065947  0.04912253]


## Building a Simple Vector Store
We'll implement a simple vector store using NumPy.

In [9]:
import numpy as np
import google.generativeai as genai
import os
from typing import List, Dict

# Your SimpleVectorStore class definition goes here...
class SimpleVectorStore:
    """
    A simple vector store implementation using NumPy.
    """
    def __init__(self):
        self.vectors = []
        self.texts = []
        self.metadata = []
    
    def add_item(self, text, embedding, metadata=None):
        self.vectors.append(np.array(embedding))
        self.texts.append(text)
        self.metadata.append(metadata or {})
    
    def similarity_search(self, query_embedding, k=5):
        if not self.vectors:
            return []
        
        query_vector = np.array(query_embedding)
        
        similarities = []
        for i, vector in enumerate(self.vectors):
            # Added a check to prevent division by zero
            if np.linalg.norm(query_vector) == 0 or np.linalg.norm(vector) == 0:
                similarity = 0
            else:
                similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],
                "metadata": self.metadata[idx],
                "similarity": score
            })
        
        return results

# --- Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- Helper function to get embedding from Gemini ---
def get_embedding(text: str, model: str = "models/embedding-001") -> List[float]:
    """
    Creates an embedding for a given text using the Gemini API.
    """
    try:
        response = genai.embed_content(model=model, content=text)
        return response['embedding']
    except Exception as e:
        print(f"An error occurred: {e}")
        return []

# --- Main Logic ---
if __name__ == "__main__":
    # 1. Initialize the vector store
    store = SimpleVectorStore()

    # 2. Add some sample data to the vector store
    docs = [
        "Homelessness is a complex social problem.",
        "A lack of affordable housing is a key contributing factor.",
        "The sun is the center of our solar system." # Unrelated text
    ]

    print("Populating vector store with Gemini embeddings...")
    for i, doc in enumerate(docs):
        embedding = get_embedding(doc)
        if embedding:
            store.add_item(doc, embedding, {"id": i})

    # 3. Perform a semantic search
    query_text = "What causes homelessness?"
    query_embedding = get_embedding(query_text)
    
    if query_embedding:
        print(f"\nSearching for relevant documents for: '{query_text}'")
        results = store.similarity_search(query_embedding, k=2)

        # 4. Print the search results
        print("\nTop 2 search results:")
        for res in results:
            print(f"- Similarity: {res['similarity']:.4f}, Text: '{res['text']}'")
    else:
        print("\nFailed to get embedding for the query.")

Populating vector store with Gemini embeddings...

Searching for relevant documents for: 'What causes homelessness?'

Top 2 search results:
- Similarity: 0.8567, Text: 'Homelessness is a complex social problem.'
- Similarity: 0.7409, Text: 'A lack of affordable housing is a key contributing factor.'


## Processing Documents with Question Augmentation
Now we'll put everything together to process documents, generate questions, and build our augmented vector store.

In [11]:


# --- 1. Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Helper functions (assumed to be defined elsewhere) ---
# Your original functions for PDF extraction, text chunking, vector store,
# and cosine similarity would be defined here. For this example,
# I'll provide simplified Gemini-compatible versions.

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extracts text from a PDF file."""
    doc = fitz.open(pdf_path)
    all_text = [page.get_text("text") for page in doc]
    doc.close()
    return "\n".join(all_text)

def chunk_text(text: str, n: int, overlap: int) -> List[str]:
    """Chunks the given text into segments of n characters with overlap."""
    chunks = []
    for i in range(0, len(text), n - overlap):
        chunks.append(text[i:i + n])
    return chunks

class SimpleVectorStore:
    """A simple vector store implementation using NumPy."""
    def __init__(self):
        self.vectors = []
        self.texts = []
        self.metadata = []
    
    def add_item(self, text, embedding, metadata=None):
        self.vectors.append(np.array(embedding))
        self.texts.append(text)
        self.metadata.append(metadata or {})
    
    # ... other methods like similarity_search would be here

def create_embeddings(text: str or List[str], model: str = "models/embedding-001") -> List[float] or List[List[float]]:
    """Creates embeddings for text using the Gemini API."""
    try:
        response = genai.embed_content(model=model, content=text)
        return response['embedding']
    except Exception as e:
        print(f"Embedding error: {e}")
        return []

def generate_questions(text_chunk: str, num_questions: int = 5, model: str = "gemini-1.5-flash") -> List[str]:
    """Generates relevant questions from a text chunk using Gemini."""
    system_prompt = "You are an expert at generating relevant questions from text. Create concise questions that can be answered using only the provided text. Focus on key information and concepts."
    user_prompt = f"Based on the following text, generate {num_questions} different questions:\n\n{text_chunk}\n\nFormat your response as a numbered list."
    try:
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        response = gemini_model.generate_content(user_prompt)
        # Assuming the response is a numbered list, we'll parse it
        return [re.sub(r'^\d+\.\s*', '', line.strip()) for line in response.text.split('\n') if line.strip()]
    except Exception as e:
        print(f"Question generation error: {e}")
        return []

# --- 3. The main processing function (revised) ---
def process_document(pdf_path: str, chunk_size: int = 1000, chunk_overlap: int = 200, questions_per_chunk: int = 5) -> Tuple[List[str], SimpleVectorStore]:
    """
    Process a document with question augmentation using Gemini.
    """
    print("Extracting text from PDF...")
    extracted_text = extract_text_from_pdf(pdf_path)
    
    print("Chunking text...")
    text_chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)
    print(f"Created {len(text_chunks)} text chunks")
    
    vector_store = SimpleVectorStore()
    
    print("Processing chunks and generating questions...")
    for i, chunk in enumerate(tqdm(text_chunks, desc="Processing Chunks")):
        # Create embedding for the chunk itself
        chunk_embedding = create_embeddings(chunk)
        if not chunk_embedding: continue

        vector_store.add_item(
            text=chunk,
            embedding=chunk_embedding,
            metadata={"type": "chunk", "index": i}
        )
        
        # Generate questions for this chunk
        questions = generate_questions(chunk, num_questions=questions_per_chunk)
        
        # Create embeddings for each question and add to vector store
        if questions:
            question_embeddings = create_embeddings(questions)
            if question_embeddings:
                for j, question in enumerate(questions):
                    vector_store.add_item(
                        text=question,
                        embedding=question_embeddings[j],
                        metadata={"type": "question", "chunk_index": i, "original_chunk": chunk}
                    )
    
    return text_chunks, vector_store

## Extracting and Processing the Document

In [12]:
# Define the path to the PDF file
pdf_path = "/Users/kekunkoya/Desktop/ISEM 770 Class Project/Homelessness.pdf"

# Process the document (extract text, create chunks, generate questions, build vector store)
text_chunks, vector_store = process_document(
    pdf_path, 
    chunk_size=1000, 
    chunk_overlap=200, 
    questions_per_chunk=3
)

print(f"Vector store contains {len(vector_store.texts)} items")

Extracting text from PDF...
Chunking text...
Created 65 text chunks
Processing chunks and generating questions...


Processing Chunks: 100%|██████████| 65/65 [01:30<00:00,  1.40s/it]

Vector store contains 260 items





## Performing Semantic Search
We implement a semantic search function similar to the simple RAG implementation but adapted to our augmented vector store.

In [13]:


# --- 2. Define the create_embeddings function for Gemini ---
def create_embeddings(text: str, model: str = "models/embedding-001") -> np.ndarray:
    """
    Creates an embedding for the given text using the Gemini API.

    Args:
    text (str): The input text to be embedded.
    model (str): The embedding model to be used. Default is "models/embedding-001".

    Returns:
    np.ndarray: The embedding vector as a NumPy array.
    """
    try:
        # The Gemini API takes a single text input for this function call
        response = genai.embed_content(model=model, content=text)
        return np.array(response['embedding'], dtype=np.float32)
    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return np.array([])

# --- 3. Define the semantic_search function (revised) ---
def semantic_search(query: str, vector_store, k: int = 5) -> List[Dict]:
    """
    Performs semantic search using the query and vector store.

    Args:
    query (str): The search query.
    vector_store (SimpleVectorStore): The vector store to search in.
    k (int): Number of results to return.

    Returns:
    List[Dict]: Top k most relevant items.
    """
    # Create embedding for the query using the Gemini-compatible function
    query_embedding = create_embeddings(query)
    
    if query_embedding.size == 0:
        return []

    # Search the vector store
    results = vector_store.similarity_search(query_embedding, k=k)
    
    return results

# --- 4. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a SimpleVectorStore class and populate it
    class SimpleVectorStore:
        def __init__(self):
            self.vectors = []
            self.texts = []
            self.metadata = []
        
        def add_item(self, text, embedding, metadata=None):
            self.vectors.append(np.array(embedding))
            self.texts.append(text)
            self.metadata.append(metadata or {})
        
        def similarity_search(self, query_embedding, k=5):
            if not self.vectors: return []
            
            query_vector = np.array(query_embedding)
            similarities = []
            for i, vector in enumerate(self.vectors):
                if np.linalg.norm(query_vector) == 0 or np.linalg.norm(vector) == 0:
                    similarity = 0
                else:
                    similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
                similarities.append((i, similarity))
            
            similarities.sort(key=lambda x: x[1], reverse=True)
            
            results = []
            for i in range(min(k, len(similarities))):
                idx, score = similarities[i]
                results.append({"text": self.texts[idx], "metadata": self.metadata[idx], "similarity": score})
            return results

    store = SimpleVectorStore()
    docs = [
        "Homelessness is a complex social problem.",
        "A lack of affordable housing is a key contributing factor.",
        "The sun is the center of our solar system."
    ]

    print("Populating vector store with Gemini embeddings...")
    for i, doc in enumerate(docs):
        embedding = create_embeddings(doc)
        if embedding.size > 0:
            store.add_item(doc, embedding, {"id": i})

    query_text = "What causes homelessness?"
    
    print(f"\nSearching for relevant documents for: '{query_text}'")
    results = semantic_search(query_text, store, k=2)

    print("\nTop 2 search results:")
    for res in results:
        print(f"- Similarity: {res['similarity']:.4f}, Text: '{res['text']}'")

Populating vector store with Gemini embeddings...

Searching for relevant documents for: 'What causes homelessness?'

Top 2 search results:
- Similarity: 0.8567, Text: 'Homelessness is a complex social problem.'
- Similarity: 0.7409, Text: 'A lack of affordable housing is a key contributing factor.'


## Running a Query on the Augmented Vector Store

In [14]:
import os
import json
import numpy as np
import google.generativeai as genai
from typing import List, Dict

# --- 1. Gemini API Configuration and Helper Functions ---
# (Assumed to be defined and configured for Gemini in a separate module or above)
def get_embedding(text: str, model: str = "models/embedding-001") -> np.ndarray:
    """Creates an embedding for a text using the Gemini API."""
    response = genai.embed_content(model=model, content=text)
    return np.array(response['embedding'], dtype=np.float32)

class SimpleVectorStore:
    def __init__(self):
        self.vectors = []
        self.texts = []
        self.metadata = []
    
    def add_item(self, text, embedding, metadata=None):
        self.vectors.append(np.array(embedding))
        self.texts.append(text)
        self.metadata.append(metadata or {})
    
    def similarity_search(self, query_embedding, k=5):
        if not self.vectors: return []
        
        query_vector = np.array(query_embedding)
        similarities = []
        for i, vector in enumerate(self.vectors):
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({"text": self.texts[idx], "metadata": self.metadata[idx], "similarity": score})
        return results

def semantic_search(query: str, vector_store, k: int = 5) -> List[Dict]:
    """Performs semantic search using the Gemini-compatible vector store."""
    query_embedding = get_embedding(query)
    if query_embedding.size == 0: return []
    results = vector_store.similarity_search(query_embedding, k=k)
    return results

# --- 2. Main Logic: Your provided code ---
if __name__ == "__main__":
    # Simulate a populated vector store for the example
    vector_store = SimpleVectorStore()
    doc_chunk = "Homelessness is a complex social problem. A key factor is the lack of affordable housing."
    q_chunk = "What is a key factor in homelessness?"
    vector_store.add_item(doc_chunk, get_embedding(doc_chunk), {"type": "chunk", "index": 0})
    vector_store.add_item(q_chunk, get_embedding(q_chunk), {"type": "question", "chunk_index": 0})

    # Load the validation data from a JSON file
    # Note: This is a placeholder since I cannot access your local file
    data = [{"question": "What are the main contributing factors to homelessness?"}]

    # Extract the first query from the validation data
    query = data[0]['question']

    # Perform semantic search to find relevant content
    search_results = semantic_search(query, vector_store, k=5)

    print("Query:", query)
    print("\nSearch Results:")

    # Organize results by type
    chunk_results = []
    question_results = []

    for result in search_results:
        if result["metadata"]["type"] == "chunk":
            chunk_results.append(result)
        else:
            question_results.append(result)

    # Print chunk results first
    print("\nRelevant Document Chunks:")
    for i, result in enumerate(chunk_results):
        print(f"Context {i + 1} (similarity: {result['similarity']:.4f}):")
        print(result["text"][:300] + "...")
        print("=====================================")

    # Then print question matches
    print("\nMatched Questions:")
    for i, result in enumerate(question_results):
        print(f"Question {i + 1} (similarity: {result['similarity']:.4f}):")
        print(result["text"])
        chunk_idx = result["metadata"]["chunk_index"]
        print(f"From chunk {chunk_idx}")
        print("=====================================")

Query: What are the main contributing factors to homelessness?

Search Results:

Relevant Document Chunks:
Context 1 (similarity: 0.8682):
Homelessness is a complex social problem. A key factor is the lack of affordable housing....

Matched Questions:
Question 1 (similarity: 0.9502):
What is a key factor in homelessness?
From chunk 0


## Generating Context for Response
Now we prepare the context by combining information from relevant chunks and questions.

In [15]:
def prepare_context(search_results):
    """
    Prepares a unified context from search results for response generation.

    Args:
    search_results (List[Dict]): Results from semantic search.

    Returns:
    str: Combined context string.
    """
    # Extract unique chunks referenced in the results
    chunk_indices = set()
    context_chunks = []
    
    # First add direct chunk matches
    for result in search_results:
        if result["metadata"]["type"] == "chunk":
            chunk_indices.add(result["metadata"]["index"])
            context_chunks.append(f"Chunk {result['metadata']['index']}:\n{result['text']}")
    
    # Then add chunks referenced by questions
    for result in search_results:
        if result["metadata"]["type"] == "question":
            chunk_idx = result["metadata"]["chunk_index"]
            if chunk_idx not in chunk_indices:
                chunk_indices.add(chunk_idx)
                context_chunks.append(f"Chunk {chunk_idx} (referenced by question '{result['text']}'):\n{result['metadata']['original_chunk']}")
    
    # Combine all context chunks
    full_context = "\n\n".join(context_chunks)
    return full_context

## Generating a Response Based on Retrieved Chunks


In [15]:
import os
import json
import numpy as np
import google.generativeai as genai
from typing import List, Dict

# --- 1. Gemini API Configuration and Helper Functions ---
# (Assumed to be defined and configured for Gemini in a separate module or above)
def get_embedding(text: str, model: str = "models/embedding-001") -> np.ndarray:
    """Creates an embedding for a text using the Gemini API."""
    response = genai.embed_content(model=model, content=text)
    return np.array(response['embedding'], dtype=np.float32)

class SimpleVectorStore:
    def __init__(self):
        self.vectors = []
        self.texts = []
        self.metadata = []
    
    def add_item(self, text, embedding, metadata=None):
        self.vectors.append(np.array(embedding))
        self.texts.append(text)
        self.metadata.append(metadata or {})
    
    def similarity_search(self, query_embedding, k=5):
        if not self.vectors: return []
        
        query_vector = np.array(query_embedding)
        similarities = []
        for i, vector in enumerate(self.vectors):
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            similarities.append((i, similarity))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({"text": self.texts[idx], "metadata": self.metadata[idx], "similarity": score})
        return results

def semantic_search(query: str, vector_store, k: int = 5) -> List[Dict]:
    """Performs semantic search using the Gemini-compatible vector store."""
    query_embedding = get_embedding(query)
    if query_embedding.size == 0: return []
    results = vector_store.similarity_search(query_embedding, k=k)
    return results

# --- 2. Your original prepare_context function ---
def prepare_context(search_results: List[Dict]) -> str:
    """
    Prepares a unified context from search results for response generation.
    """
    chunk_indices = set()
    context_chunks = []
    
    for result in search_results:
        if result["metadata"]["type"] == "chunk":
            chunk_indices.add(result["metadata"]["index"])
            context_chunks.append(f"Chunk {result['metadata']['index']}:\n{result['text']}")
    
    for result in search_results:
        if result["metadata"]["type"] == "question":
            chunk_idx = result["metadata"]["chunk_index"]
            if chunk_idx not in chunk_indices:
                chunk_indices.add(chunk_idx)
                context_chunks.append(f"Chunk {chunk_idx} (referenced by question '{result['text']}'):\n{result['metadata']['original_chunk']}")
    
    full_context = "\n\n".join(context_chunks)
    return full_context

# --- 3. Main Logic ---
if __name__ == "__main__":
    # Simulate a populated vector store with both chunks and questions
    vector_store = SimpleVectorStore()
    chunk1_text = "Homelessness is a complex social issue with many contributing factors."
    chunk2_text = "The lack of affordable housing is a primary cause of homelessness."
    q1_text = "What is a major cause of homelessness?"
    
    chunk1_embedding = get_embedding(chunk1_text)
    chunk2_embedding = get_embedding(chunk2_text)
    q1_embedding = get_embedding(q1_text)

    vector_store.add_item(chunk1_text, chunk1_embedding, {"type": "chunk", "index": 0})
    vector_store.add_item(chunk2_text, chunk2_embedding, {"type": "chunk", "index": 1})
    vector_store.add_item(q1_text, q1_embedding, {"type": "question", "chunk_index": 1, "original_chunk": chunk2_text})

    query = "What are the key drivers of homelessness?"
    
    # Perform semantic search to find relevant content
    search_results = semantic_search(query, vector_store, k=5)
    
    # Prepare the context using your function
    prepared_context = prepare_context(search_results)
    
    print("Query:", query)
    print("\nPrepared context for LLM:")
    print(prepared_context)

Query: What are the key drivers of homelessness?

Prepared context for LLM:
Chunk 0:
Homelessness is a complex social issue with many contributing factors.

Chunk 1:
The lack of affordable housing is a primary cause of homelessness.


## Generating and Displaying the Response

In [17]:
# Prepare context from search results
context = prepare_context(search_results)

# Generate response
response_text = generate_response(query, context)

print("\nQuery:", query)
print("\nResponse:")
print(response_text)


Query: What is the ETHOS typology?

Response:
The ETHOS typology, or European Typology on Homelessness and Housing Exclusion, is a conceptual framework developed to define and measure homelessness and housing exclusion in Europe. It categorizes various living situations into four main conceptual categories: roofless, houseless, insecure housing, and inadequate housing. The most recent version includes thirteen operational categories and twenty-four different living situations. It aims to improve the comparability of homelessness definitions and data across different EU countries.


## Evaluating the AI Response
We compare the AI response with the expected answer and assign a score.

In [16]:

# --- 2. Define the evaluation function for Gemini ---
def evaluate_response(query: str, response: str, reference_answer: str, model: str = "gemini-1.5-flash") -> str:
    """
    Evaluates the AI response against a reference answer using Gemini.
    
    Args:
    query (str): The user's question.
    response (str): The AI-generated response.
    reference_answer (str): The reference/ideal answer.
    model (str): Model to use for evaluation.
    
    Returns:
    str: Evaluation feedback.
    """
    # Define the system prompt for the evaluation system
    evaluate_system_prompt = """You are an intelligent evaluation system tasked with assessing AI responses.
    
    Compare the AI assistant's response to the true/reference answer, and evaluate based on:
    1. Factual correctness - Does the response contain accurate information?
    2. Completeness - Does it cover all important aspects from the reference?
    3. Relevance - Does it directly address the question?

    Assign a score from 0 to 1:
    - 1.0: Perfect match in content and meaning
    - 0.8: Very good, with minor omissions/differences
    - 0.6: Good, covers main points but misses some details
    - 0.4: Partial answer with significant omissions
    - 0.2: Minimal relevant information
    - 0.0: Incorrect or irrelevant

    Provide your score with justification.
    """
    
    # Create the evaluation prompt
    evaluation_prompt = f"""
    User Query: {query}
    
    AI Response:
    {response}

    Reference Answer:
    {reference_answer}
    
    Please evaluate the AI response against the reference answer.
    """

    # Generate evaluation
    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=evaluate_system_prompt)
        eval_response = gemini_model.generate_content(evaluation_prompt)
        return eval_response.text
    except Exception as e:
        print(f"An error occurred during evaluation generation: {e}")
        return "Evaluation failed due to an error."

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a query, AI response, and reference answer
    query = "What is the capital of France?"
    ai_response = "The capital of France is Paris, which is also a major European city."
    reference_answer = "Paris"

    print("Generating evaluation with Gemini...")
    # Generate the evaluation feedback
    evaluation_feedback = evaluate_response(query, ai_response, reference_answer)

    print("\nEvaluation Feedback:")
    print(evaluation_feedback)

Generating evaluation with Gemini...

Evaluation Feedback:
Score: 0.8

Justification:

1. **Factual Correctness:** The AI response correctly identifies Paris as the capital of France.  The added information about Paris being a major European city is factually accurate but not strictly necessary to answer the question.

2. **Completeness:** The AI response provides the core answer. While it includes extra information, it doesn't detract from the accuracy of the main point.  The reference answer is more concise, but the AI response isn't overly verbose.

3. **Relevance:** The AI response directly and completely addresses the user's question. The additional sentence about Paris being a major European city is relevant to the context but not crucial to answering the question itself.


The AI response is slightly more verbose than the reference answer, but the added information is accurate and doesn't make the response incorrect or less relevant.  Therefore, a score of 0.8 is appropriate.



## Running the Evaluation

In [20]:
# Get reference answer from validation data
reference_answer = data[0]['ideal_answer']

# Evaluate the response
evaluation = evaluate_response(query, response_text, reference_answer)

print("\nEvaluation:")
print(evaluation)


Evaluation:
Score: 1.0

Justification: The AI response accurately describes the ETHOS typology, including its full name (European Typology on Homelessness and Housing Exclusion) and its purpose as a framework for defining and measuring homelessness and housing exclusion in Europe. It correctly identifies the four main conceptual categories (roofless, houseless, insecure housing, and inadequate housing) and mentions the thirteen operational categories and twenty-four living situations. The response also highlights the aim of improving comparability across EU countries, which aligns perfectly with the reference answer. There are no significant omissions or inaccuracies, making it a perfect match in content and meaning.


In [17]:

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate data from previous steps
    with open('/Users/kekunkoya/Desktop/ISEM 770 Class Project/valh.json') as f:
        data = json.load(f)
    
    # Assuming 'response_text' is the generated AI response from a previous step
    response_text = "Homelessness is a complex social problem caused by various economic, social, and personal issues."
    query = data[0]['question']

    # Get reference answer from validation data
    reference_answer = data[0]['ideal_answer']

    # Evaluate the response
    evaluation = evaluate_response(query, response_text, reference_answer)

    print("\nEvaluation:")
    print(evaluation)


Evaluation:
Score: 0.0

Justification:

The AI's response is completely irrelevant to the question.  It discusses homelessness in general terms, but doesn't mention or even allude to the ETHOS typology, which is what the question specifically asks about.  The reference answer accurately defines the ETHOS typology, while the AI response provides entirely unrelated information. Therefore, a score of 0.0 is appropriate.



## Extracting and Chunking Text from a PDF File
Now, we load the PDF, extract text, and split it into chunks.

In [19]:
# Define the path to the PDF file
pdf_path = "/Users/kekunkoya/Desktop/ISEM 770 Class Project/Homelessness.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters
text_chunks = chunk_text(extracted_text, 1000, 200)

# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))

# Print the first text chunk
print("\nFirst text chunk:")
print(text_chunks[0])

Number of text chunks: 65

First text chunk:
19
Defining and Measuring Homelessness
Volker Busch-Geertsema
GISS, Germany
>> Abstract_ Substantial progress has been made at EU level on defining home-
lessness. The European Typology on Homelessness and Housing Exclusion 
(ETHOS) is widely accepted in almost all European countries (and beyond) as 
a useful conceptual framework and almost everywhere definitions at national 
level (though often not identical with ETHOS) are discussed in relation to this 
typology. The development and some of the remaining controversial issues 
concerning ETHOS and a reduced version of it are discussed in this chapter. 
Furthermore essential reasons and different approaches to measure home-
lessness are presented. It is argued that a single number will not be enough 
to understand homelessness and monitor progress in tackling it. More 
research and more work to improve information on homelessness at national 
levels will be needed before we can achieve compa

## Creating Embeddings for Text Chunks
Embeddings transform text into numerical vectors, which allow for efficient similarity search.

In [22]:
import numpy as np
import google.generativeai as genai
import os
from typing import List, Any # Import Any from the typing module

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the create_embeddings function for Gemini ---
def create_embeddings(texts: List[str] or str, model: str = "models/embedding-001") -> Any:
    """
    Creates embeddings for the given text or list of texts using the specified Gemini model.
    
    Args:
        texts (List[str] or str): The input text(s) for which embeddings are to be created.
        model (str): The model to be used for creating embeddings. Default is "models/embedding-001".
        
    Returns:
        Any: A list of numpy arrays, where each array is an embedding.
    """
    try:
        # The Gemini API handles both single strings and lists of strings
        response = genai.embed_content(
            model=model,
            content=texts
        )
        # The response is a dictionary with a single key 'embedding'
        # The value is a list of embeddings.
        return [np.array(emb, dtype=np.float32) for emb in response['embedding']]
    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return []

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate text chunks from a previous step
    text_chunks = [
        "Homelessness is a complex social problem.",
        "A lack of affordable housing is a key contributing factor."
    ]

    print("Creating embeddings with Gemini...")
    # Create embeddings for the text chunks
    embeddings = create_embeddings(text_chunks)

    if embeddings:
        print("\nEmbeddings created successfully.")
        print(f"Number of embeddings: {len(embeddings)}")
        print(f"Shape of first embedding: {embeddings[0].shape}")
    else:
        print("\nFailed to create embeddings.")

Creating embeddings with Gemini...

Embeddings created successfully.
Number of embeddings: 2
Shape of first embedding: (768,)


In [24]:
import os
import google.generativeai as genai
from typing import List, Any
import numpy as np

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the create_embeddings function for Gemini ---
def create_embeddings(texts: List[str] or str, model: str = "models/embedding-001") -> Any:
    """
    Creates embeddings for the given text or list of texts using the specified Gemini model.

    Args:
    texts (List[str] or str): The input text(s) for which embeddings are to be created.
    model (str): The model to be used for creating embeddings. Default is "models/embedding-001".

    Returns:
    List[np.ndarray]: A list of numpy arrays, where each array is an embedding.
    """
    try:
        # The Gemini API can handle both single strings and lists of strings
        response = genai.embed_content(
            model=model,
            content=texts
        )
        # The response is a dictionary with a key 'embedding'
        # The value is a list of embeddings.
        return [np.array(emb, dtype=np.float32) for emb in response['embedding']]
    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return []

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate text chunks from a previous step
    text_chunks = [
        "Homelessness is a complex social problem.",
        "A lack of affordable housing is a key contributing factor."
    ]

    print("Creating embeddings with Gemini...")
    # Create embeddings for the text chunks
    embeddings = create_embeddings(text_chunks)

    if embeddings:
        print("\nEmbeddings created successfully.")
        print(f"Number of embeddings: {len(embeddings)}")
        print(f"Shape of first embedding: {embeddings[0].shape}")
    else:
        print("\nFailed to create embeddings.")

# The embed_content method from the Google Gemini API is a direct equivalent to the OpenAI function for generating embeddings.

Creating embeddings with Gemini...

Embeddings created successfully.
Number of embeddings: 2
Shape of first embedding: (768,)


## Performing Semantic Search
We implement cosine similarity to find the most relevant text chunks for a user query.

In [25]:
def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): The first vector.
    vec2 (np.ndarray): The second vector.

    Returns:
    float: The cosine similarity between the two vectors.
    """
    # Compute the dot product of the two vectors and divide by the product of their norms
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [26]:
def semantic_search(query, text_chunks, embeddings, k=5):
    """
    Performs semantic search on the text chunks using the given query and embeddings.

    Args:
    query (str): The query for the semantic search.
    text_chunks (List[str]): A list of text chunks to search through.
    embeddings (List[dict]): A list of embeddings for the text chunks.
    k (int): The number of top relevant text chunks to return. Default is 5.

    Returns:
    List[str]: A list of the top k most relevant text chunks based on the query.
    """
    # Create an embedding for the query
    query_embedding = create_embeddings(query).data[0].embedding
    similarity_scores = []  # Initialize a list to store similarity scores

    # Calculate similarity scores between the query embedding and each text chunk embedding
    for i, chunk_embedding in enumerate(embeddings):
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
        similarity_scores.append((i, similarity_score))  # Append the index and similarity score

    # Sort the similarity scores in descending order
    similarity_scores.sort(key=lambda x: x[1], reverse=True)
    # Get the indices of the top k most similar text chunks
    top_indices = [index for index, _ in similarity_scores[:k]]
    # Return the top k most relevant text chunks
    return [text_chunks[index] for index in top_indices]


In [5]:
import numpy as np
from typing import List, Any
import google.generativeai as genai
import os
import json

# --- Helper functions (assumed to be defined) ---
def create_embeddings(texts: List[str] or str, model: str = "models/embedding-001") -> Any:
    # Your embedding function as defined in previous steps
    try:
        response = genai.embed_content(model=model, content=texts)
        return [np.array(emb, dtype=np.float32) for emb in response['embedding']]
    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return []

def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
    # Your cosine similarity function as defined in previous steps
    dot_product = np.dot(vec1, vec2)
    norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)
    return dot_product / norm_product if norm_product != 0 else 0.0

# --- Corrected semantic_search function with error checking ---
def semantic_search(query: str, text_chunks: List[str], embeddings: List[np.ndarray], k: int = 5) -> List[str]:
    """
    Performs semantic search to find the most relevant text chunks.
    """
    # ⚠️ CRITICAL CHECK: The number of text chunks and embeddings must be equal.
    if len(text_chunks) != len(embeddings):
        raise ValueError(
            f"The number of text chunks ({len(text_chunks)}) must be equal to "
            f"the number of embeddings ({len(embeddings)})."
        )

    query_embedding = create_embeddings(query)[0]
    
    if query_embedding.size == 0:
        return []

    similarities = [cosine_similarity(query_embedding, emb) for emb in embeddings]
    
    top_indices = np.argsort(similarities)[-k:][::-1]
    
    text_chunks_array = np.array(text_chunks)

    return list(text_chunks_array[top_indices])

# --- Main logic (re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a full pipeline
    GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
    if not GOOGLE_API_KEY:
        print("GOOGLE_API_KEY env var not set. Exiting.")
        exit()
    genai.configure(api_key=GOOGLE_API_KEY)

    val_path = '/Users/kekunkoya/Desktop/ISEM 770 Class Project/valh.json'
    if not os.path.isfile(val_path):
        print(f"File not found at: {val_path}")
        exit()
    with open(val_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    query = data[0]['question']
    text_chunks = [
        "Homelessness is a complex social problem with various contributing factors, including economic, social, and personal issues.",
        "A key factor is the lack of affordable housing, which disproportionately affects low-income families and individuals.",
        "Social factors like family breakdown, domestic violence, and a lack of social support networks can also lead to homelessness.",
        "Personal crises, such as job loss, mental health challenges, or substance abuse, are often triggers for losing housing."
    ]
    
    # Create embeddings for the simulated text chunks
    print("Creating embeddings...")
    embeddings = create_embeddings(text_chunks)
    if not embeddings:
        print("Failed to create embeddings.")
        exit()
    
    # Perform semantic search to find the top 2 most relevant text chunks
    print(f"\nSearching for chunks relevant to: '{query}'")
    top_chunks = semantic_search(query, text_chunks, embeddings, k=2)

    # Print the results
    print("\nQuery:", query)
    print("Top 2 most relevant text chunks:")
    for i, chunk in enumerate(top_chunks):
        print(f"Context {i + 1}:\n{chunk}\n{'='*40}")

Creating embeddings...

Searching for chunks relevant to: 'What is the ETHOS typology?'


IndexError: index 667 is out of bounds for axis 0 with size 4

## Running a Query on Extracted Chunks

In [25]:
# Load the validation data from a JSON file
with open('/Users/kekunkoya/Desktop/ISEM 770 Class Project/valh.json') as f:
    data = json.load(f)

# Extract the first query from the validation data
query = data[0]['question']

# Perform semantic search to find the top 2 most relevant text chunks for the query
top_chunks = semantic_search(query, text_chunks, response.data, k=2)

# Print the query
print("Query:", query)

# Print the top 2 most relevant text chunks
for i, chunk in enumerate(top_chunks):
    print(f"Context {i + 1}:\n{chunk}\n=====================================")

Query: What is the ETHOS typology?
Context 1:
he forth and fifth reviews of statistics 
(Edgar and Meert, 2005, 2006) focused on developing and refining the ETHOS definition and 
considering the measurement issues involved in greater detail. 

24
Homelessness Research in Europe
Table 1.2 ETHOS – European typology on homelessness and housing exclusion
Conceptual 
category
Operational category
Living situation
ROOFLESS
1
People living rough
1.1
Public space or external space
2
People staying in a night shelter 2.1
Night shelter
HOUSELESS
3
People in accommodation  
for the homeless
3.1
3.2
3.3
Homeless hostel
Temporary accommodation
Transitional supported 
accommodation
4
People in a women’s shelter
4.1
Women’s shelter accommodation
5
People in accommodation  
for immigrants
5.1
5.2
Temporary accommodation, 
reception centres 
Migrant workers’ accommodation
6
People due to be released  
from institutions
6.1
6.2
6.3
Penal institutions
Medical institutions
Children’s institutions/homes
7


In [6]:
# --- Corrected semantic_search function with error checking ---
def semantic_search(query: str, text_chunks: list[str], embeddings: list[np.ndarray], k: int = 5) -> list[str]:
    """
    Performs semantic search to find the most relevant text chunks.
    """
    #  CRITICAL CHECK: The number of text chunks and embeddings must be equal.
    if len(text_chunks) != len(embeddings):
        raise ValueError(
            f"The number of text chunks ({len(text_chunks)}) must be equal to "
            f"the number of embeddings ({len(embeddings)})."
        )

    # ... The rest of your function logic follows ...
    query_embedding = create_embeddings(query)[0]
    
    if query_embedding.size == 0:
        return []

    similarities = [cosine_similarity(query_embedding, emb) for emb in embeddings]
    
    top_indices = np.argsort(similarities)[-k:][::-1]
    
    text_chunks_array = np.array(text_chunks)

    return list(text_chunks_array[top_indices])

## Generating a Response Based on Retrieved Chunks

In [26]:
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="gpt-4o-mini"):
    """
    Generates a response from the AI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The model to be used for generating the response. Default is "meta-llama/Llama-2-7B-chat-hf".

    Returns:
    dict: The response from the AI model.
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)

In [8]:
import os
import google.generativeai as genai
from typing import List, Dict

# --- 1. Gemini API Configuration ---
# Your GOOGLE_API_KEY should be set in your environment
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the response generator for Gemini ---
def generate_response(system_prompt: str, user_message: str, model: str = "gemini-1.5-flash") -> str:
    """
    Generates a response from the Gemini model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the AI's behavior.
    user_message (str): The user's message or query.
    model (str): The model to be used for generating the response. Default is "gemini-1.5-flash".

    Returns:
    str: The response from the AI model as a string.
    """
    try:
        # Pass the system prompt to the GenerativeModel's system_instruction parameter
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        response = gemini_model.generate_content(user_message)
        return response.text
    except Exception as e:
        print(f"An error occurred during response generation: {e}")
        return "I could not generate a response due to an error."

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a query and top_chunks from previous steps
    query = "What are the social factors that contribute to homelessness?"
    top_chunks = [
        "Homelessness is a complex social problem with various contributing factors, including economic, social, and personal issues.",
        "Social factors like family breakdown, domestic violence, and a lack of social support networks can also lead to homelessness."
    ]

    # Define the system prompt for the AI assistant
    system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

    # Create the user prompt based on the top chunks
    user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
    user_prompt = f"{user_prompt}\nQuestion: {query}"

    # Generate AI response
    print("Generating AI response with Gemini...")
    ai_response = generate_response(system_prompt, user_prompt)

    # Print the final AI response
    print("\nAI Response:")
    print(ai_response)

Generating AI response with Gemini...

AI Response:
Based on the provided text, social factors contributing to homelessness include family breakdown, domestic violence, and a lack of social support networks.



## Evaluating the AI Response
We compare the AI response with the expected answer and assign a score.

In [27]:
# Define the system prompt for the evaluation system
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# Print the evaluation response
print(evaluation_response.choices[0].message.content)

Score: 1
