## Context-Enriched Retrieval in RAG
Retrieval-Augmented Generation (RAG) enhances AI responses by retrieving relevant knowledge from external sources. Traditional retrieval methods return isolated text chunks, which can lead to incomplete answers.

To address this, we introduce Context-Enriched Retrieval, which ensures that retrieved information includes neighboring chunks for better coherence.

Steps in This Notebook:
- Data Ingestion: Extract text from a PDF.
- Chunking with Overlapping Context: Split text into overlapping chunks to preserve context.
- Embedding Creation: Convert text chunks into numerical representations.
- Context-Aware Retrieval: Retrieve relevant chunks along with their neighbors for better completeness.
- Response Generation: Use a language model to generate responses based on retrieved context.
- Evaluation: Assess the model's response accuracy.

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [2]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file and prints the first `num_chars` characters.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open(pdf_path)
    all_text = ""  # Initialize an empty string to store the extracted text

    # Iterate through each page in the PDF
    for page_num in range(mypdf.page_count):
        page = mypdf[page_num]  # Get the page
        text = page.get_text("text")  # Extract text from the page
        all_text += text  # Append the extracted text to the all_text string

    return all_text  # Return the extracted text

## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [3]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [4]:
import os
import google.generativeai as genai

# --- Configure Google Generative AI client ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    print("Error: GOOGLE_API_KEY environment variable is not set.")
    exit(1)

try:
    genai.configure(api_key=GOOGLE_API_KEY)
except Exception as e:
    print(f"Error configuring Google Generative AI: {e}")
    exit(1)

# --- Initialize models ---
# For chat completions
chat_model = genai.GenerativeModel("gemini-2.0-flash")

# For embeddings
embedding_model = genai.GenerativeModel("text-embedding-004")

## Extracting and Chunking Text from a PDF File
Now, we load the PDF, extract text, and split it into chunks.

In [5]:
# Define the path to the PDF file
pdf_path = "/Users/kekunkoya/Desktop/RAG Google/PEMA.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters
text_chunks = chunk_text(extracted_text, 1000, 200)

# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))

# Print the first text chunk
print("\nFirst text chunk:")
print(text_chunks[0])

Number of text chunks: 69

First text chunk:
PENNSYLVANIA
EMERGENCY
PREPAREDNESS
GUIDE
Be Informed. Be Prepared. Be Involved. 
www.Ready.PA.gov 
readypa@pa.gov
Emergency Preparedness Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Table of Contents
TABLE OF CONTENTS  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pages 2-3
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Page    4
TOP 10 EMERGENCIES . . . . . . . . . . . . . . . . . . . . . . Pages 4-7         
       
       
     
Floods • Fires • Winter Storms • Tropical Storms, Tornadoes 
and Thunderstorms • Influenza (Flu) Pandemic • Hazardous 
Material Incidents • Earthquakes and Landslides • Nuclear 
Threat • Dam Failures • Terrorism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
BE PREPARED – MAKE A PLAN   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 

## Creating Embeddings for Text Chunks
Embeddings transform text into numerical vectors, which allow for efficient similarity search.

In [6]:
def create_embeddings(text, model_name="text-embedding-004"):
    """
    Creates embeddings for the given text using the specified Google AI model.

    Args:
    text (str or list of str): The input text(s) for which embeddings are to be created.
    model_name (str): The model to be used for creating embeddings. Default is "text-embedding-004".

    Returns:
    list: A list of embedding vectors. Returns an empty list on error.
    """
    is_list_input = isinstance(text, list)

    try:
        if is_list_input:
            response = genai.embed_content(model=model_name, content=text, task_type="RETRIEVAL_DOCUMENT")
            return [part['values'] for part in response['embedding']] # Access embedding values
        else:
            response = genai.embed_content(model=model_name, content=[text], task_type="RETRIEVAL_DOCUMENT")
            return response['embedding'][0]['values'] # Access embedding values for single text
    except Exception as e:
        print(f"Error creating embeddings: {e}")
        return [] # Return empty list on error

# Example usage (assuming 'text_chunks' is defined from your PDF processing)
# This part would typically follow the PDF extraction and chunking.
# For this example, we'll use a placeholder if text_chunks isn't available.
# text_chunks = ["Example chunk one.", "Example chunk two."] # Uncomment for testing if text_chunks is not defined

# if 'text_chunks' in locals() and text_chunks:
#     chunk_embeddings = create_embeddings(text_chunks, model_name='text-embedding-004')
#     formatted_chunk_embeddings = [{'embedding': emb} for emb in chunk_embeddings]
#     print(f"Generated {len(formatted_chunk_embeddings)} embeddings.")
# else:
#     print("Warning: 'text_chunks' not found or empty. Cannot generate embeddings without text data.")


## Implementing Context-Aware Semantic Search
We modify retrieval to include neighboring chunks for better context.

In [7]:
def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): The first vector.
    vec2 (np.ndarray): The second vector.

    Returns:
    float: The cosine similarity between the two vectors.
    """
    # Compute the dot product of the two vectors and divide by the product of their norms
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [8]:
def context_enriched_search(query, text_chunks, embeddings, k=1, context_size=1):
    """
    Retrieves the most relevant chunk along with its neighboring chunks.

    Args:
    query (str): Search query.
    text_chunks (List[str]): List of text chunks.
    embeddings (List[dict]): List of chunk embeddings.
    k (int): Number of relevant chunks to retrieve.
    context_size (int): Number of neighboring chunks to include.

    Returns:
    List[str]: Relevant text chunks with contextual information.
    """
    # Convert the query into an embedding vector
    query_embedding = create_embeddings(query).data[0].embedding
    similarity_scores = []

    # Compute similarity scores between query and each text chunk embedding
    for i, chunk_embedding in enumerate(embeddings):
        # Calculate cosine similarity between the query embedding and current chunk embedding
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
        # Store the index and similarity score as a tuple
        similarity_scores.append((i, similarity_score))

    # Sort chunks by similarity score in descending order (highest similarity first)
    similarity_scores.sort(key=lambda x: x[1], reverse=True)

    # Get the index of the most relevant chunk
    top_index = similarity_scores[0][0]

    # Define the range for context inclusion
    # Ensure we don't go below 0 or beyond the length of text_chunks
    start = max(0, top_index - context_size)
    end = min(len(text_chunks), top_index + context_size + 1)

    # Return the relevant chunk along with its neighboring context chunks
    return [text_chunks[i] for i in range(start, end)]

## Running a Query with Context Retrieval
We now test the context-enriched retrieval.

In [9]:
# Corrected Python Code with Fixed f-string Syntax and Imports

import os
import json

# Load the validation dataset from a JSON file
json_file_path = '/Users/kekunkoya/Desktop/RAG Google/PA211_dataset.json'
try:
    with open(json_file_path, 'r') as f:
        data = json.load(f)
except FileNotFoundError:
    print(f"Error: The file '{json_file_path}' was not found.")
    print("Please ensure the path is correct and the file exists.")
    # Create a dummy val.json for demonstration if it doesn't exist
    data = [{"question": "What are resources available for food ?"}]
    os.makedirs(os.path.dirname(json_file_path), exist_ok=True)
    with open(json_file_path, 'w') as f:
        json.dump(data, f)
    print("A dummy 'val.json' file has been created for demonstration purposes.")
except json.JSONDecodeError:
    print(f"Error: Could not decode JSON from '{json_file_path}'. Please check file content.")
    data = [{"question": "What is the primary function of embeddings?"}]

# Extract the first question from the dataset to use as our query
if data:
    query = data[0]['question']
else:
    query = "What is the purpose of this code?"
    print("Warning: No questions found in val.json. Using a default query.")

# Placeholder variables for demonstration
text_chunks = ["Chunk A", "Chunk B", "Chunk C"]
response = type("Response", (), {"data": []})()  # Dummy response object

# Retrieve the most relevant chunk and its neighboring chunks for context
# top_chunks = context_enriched_search(query, text_chunks, response.data, k=1, context_size=1)
# For demo purposes, we'll pretend we got one chunk:
top_chunks = ["Example context chunk for query."]

# Print the query for reference
print("\nQuery:", query)

# Print each retrieved chunk with a heading and separator
for i, chunk in enumerate(top_chunks):
    print(f"\n--- Context {i + 1} ---\n{chunk}\n")  # Fixed f-string literal



Query: Where can I find emergency food in ZIP code 17104?

--- Context 1 ---
Example context chunk for query.



## Generating a Response Using Retrieved Context
We now generate a response using LLM.

In [10]:
# Updated Code with Corrected Embedding Handling and f-string

import os
import json

# Placeholder for create_embeddings function
def create_embeddings(text_list):
    """
    Dummy implementation: in real usage, this should call your embedding API
    and return a list of objects with an .embedding attribute.
    """
    class Embedding:
        def __init__(self, vector):
            self.embedding = vector
    # Return dummy embeddings (e.g., zero-vectors)
    return [Embedding([0.0] * 768) for _ in text_list]

# Placeholder for context_enriched_search function
def context_enriched_search(query, text_chunks, chunk_embeddings, k=1, context_size=1):
    """
    Dummy search: returns the first chunk for demonstration.
    """
    return [text_chunks[0]] if text_chunks else []

# Load the validation dataset from a JSON file
json_file_path = '/Users/kekunkoya/Desktop/RAG Google/PA211_dataset.json'
try:
    with open(json_file_path, 'r') as f:
        data = json.load(f)
except FileNotFoundError:
    print(f"Error: The file '{json_file_path}' was not found.")
    print("Please ensure the path is correct and the file exists.")
    # Create a dummy val.json for demonstration if it doesn't exist
    data = [{"question": "What is AI Explainable AI"}]
    os.makedirs(os.path.dirname(json_file_path), exist_ok=True)
    with open(json_file_path, 'w') as f:
        json.dump(data, f)
    print("A dummy 'val.json' file has been created for demonstration purposes.")
except json.JSONDecodeError:
    print(f"Error: Could not decode JSON from '{json_file_path}'. Please check file content.")
    data = [{"question": "What is the primary function of embeddings?"}]

# Extract the first question from the dataset to use as our query
if data:
    query = data[0]['question']
else:
    query = "Where can I find emergency food in ZIP code 17104?"
    print("Warning: No questions found in val.json. Using a default query.")

# Prepare text chunks and their embeddings (placeholder)
text_chunks = ["Chunk A", "Chunk B", "Chunk C"]
chunk_embeddings = create_embeddings(text_chunks)

# Create embedding for the query
query_embedding_obj = create_embeddings([query])[0]  # get first embedding object
query_embedding = query_embedding_obj.embedding

# Retrieve the most relevant chunk and its neighboring chunks for context
top_chunks = context_enriched_search(query, text_chunks, chunk_embeddings, k=1, context_size=1)

# Print the query for reference
print("\nQuery:", query)

# Print each retrieved chunk with a heading and separator using single quotes in f-string
for i, chunk in enumerate(top_chunks):
    print(f'\n--- Context {i + 1} ---\n{chunk}\n=====================================')



Query: Where can I find emergency food in ZIP code 17104?

--- Context 1 ---
Chunk A


## Evaluating the AI Response
We compare the AI response with the expected answer and assign a score.

In [1]:
import json
import numpy as np
import os
import google.generativeai as genai
from dotenv import load_dotenv

# --- Load the dataset ---
with open("PA211_dataset.json", "r") as f:
    data = json.load(f)

# --- Initialize Gemini client ---
# Load environment variables from a .env file
load_dotenv()

# Get the API key from the environment variable
api_key = os.getenv("GEMINI_API_KEY")

# Check if the API key is set
if not api_key:
    print("Error: The GEMINI_API_KEY environment variable is not set.")
    exit()

# Configure the Gemini API client
try:
    genai.configure(api_key=api_key)
except Exception as e:
    print(f"An error occurred during Gemini API configuration: {e}")
    exit()

# --- Cosine similarity function ---
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# --- Embedding-based evaluation ---
def evaluate_response(user_query, ai_answer, true_answer):
    # Use a valid Gemini embedding model.
    # 'models/text-embedding-004' is not a standard Gemini model name.
    # 'models/embedding-001' is the common choice.
    # It seems 'text-embedding-004' was a typo or a different model.
    ai_emb_response = genai.embed_content(
        model="models/embedding-001",
        content=ai_answer
    )
    ai_emb = ai_emb_response['embedding']
    
    true_emb_response = genai.embed_content(
        model="models/embedding-001",
        content=true_answer
    )
    true_emb = true_emb_response['embedding']

    similarity = cosine_similarity(ai_emb, true_emb)

    if similarity > 0.90:
        score = 1
    elif similarity > 0.70:
        score = 0.5
    else:
        score = 0

    return score, similarity

# --- LLM verification ---
def llm_verify(user_query, ai_answer, true_answer):
    evaluation_prompt = f"""
You are an evaluation system. Compare the AI's answer with the true answer.

User Query: {user_query}

AI Answer:
{ai_answer}

True Answer:
{true_answer}

Output ONLY a number:
1 - Very close match
0.5 - Partial match
0 - Incorrect
"""
    # Use a valid Gemini chat model. 'gemini-pro' is a good choice.
    model = genai.GenerativeModel("gemini-2.0-flash")
    
    llm_response = model.generate_content(
        evaluation_prompt,
        generation_config=genai.GenerationConfig(
            temperature=0
        )
    )
    
    # Access the text content from the response
    return llm_response.text.strip()

# --- Main evaluation loop ---
results = []

for i, item in enumerate(data):
    user_query = item['question']
    true_answer = item['ideal_answer']

    # Step 1: Generate AI answer using a Gemini model
    model = genai.GenerativeModel("gemini-2.0-flash")
    ai_gen = model.generate_content(
        user_query,
        generation_config=genai.GenerationConfig(
            temperature=0
        )
    )
    ai_answer = ai_gen.text
    
    # Step 2: Embedding-based scoring
    score, similarity = evaluate_response(user_query, ai_answer, true_answer)

    # Step 3: Optional LLM verification
    llm_score = llm_verify(user_query, ai_answer, true_answer)

    # Step 4: Store results
    results.append({
        "query": user_query,
        "ai_answer": ai_answer,
        "true_answer": true_answer,
        "embedding_score": score,
        "similarity": round(similarity, 3),
        "llm_score": llm_score
    })

# --- Save results to file ---
with open("evaluation_results.json", "w") as f:
    json.dump(results, f, indent=2)

print("✅ Evaluation complete! Results saved to evaluation_results.json")

Error: The GEMINI_API_KEY environment variable is not set.
✅ Evaluation complete! Results saved to evaluation_results.json


: 

In [1]:
import json

# --- Load evaluation results ---
with open("evaluation_results.json", "r") as f:
    results = json.load(f)

# --- Print results in a readable table ---
print(f"{'Query':<50} | {'Embed Score':<12} | {'Similarity':<10} | {'LLM Score':<9}")
print("-" * 90)

for r in results:
    query_preview = (r['query'][:47] + "...") if len(r['query']) > 50 else r['query']
    print(f"{query_preview:<50} | {r['embedding_score']:<12} | {r['similarity']:<10} | {r['llm_score']:<9}")

# --- Optional: Show only incorrect answers ---
print("\n Incorrect answers (Embed Score = 0):")
for r in results:
    if r['embedding_score'] == 0:
        print(f"- {r['query']}")


Query                                              | Embed Score  | Similarity | LLM Score
------------------------------------------------------------------------------------------
Where can I find emergency food in ZIP code 17104? | 1            | 0.916      | 0.5      
I lost power in 17104. Where can I go to get ic... | 0.5          | 0.876      | 0.5      
Is there an emergency shelter near 17104 right ... | 0.5          | 0.855      | 0.5      
Where can seniors in 17104 get help with medica... | 0.5          | 0.872      | 1        
How can I reach someone for water delivery in Z... | 0.5          | 0.788      | 0.5      
What food programs are available for children a... | 0.5          | 0.834      | 0.5      
Where can I get culturally appropriate food aft... | 0.5          | 0.754      | 0.5      
Is it safe to drink the tap water in 17104 afte... | 0.5          | 0.803      | 1        
What if I need water for medical equipment but ... | 0.5          | 0.71       | 0.5      

In [3]:
import json

# --- Load evaluation results ---
with open("evaluation_results.json", "r") as f:
    results = json.load(f)

# --- Print results in a readable table ---
print(f"{'Query':<50} | {'AI Response':<60} | {'Embed Score':<12} | {'Similarity':<10} | {'LLM Score':<9}")
print("-" * 150)

for r in results:
    # Truncate long query and AI response for display
    query_preview = (r['query'][:47] + "...") if len(r['query']) > 50 else r['query']
    ai_preview = (r['ai_answer'][:57] + "...") if len(r['ai_answer']) > 60 else r['ai_answer']

    print(f"{query_preview:<50} | {ai_preview:<60} | {r['embedding_score']:<12} | {r['similarity']:<10} | {r['llm_score']:<9}")

# --- Optional: Show only incorrect answers with details ---
print("\n  Incorrect answers (Embed Score = 0):")
for r in results:
    if r['embedding_score'] == 0:
        print(f"\nQ: {r['query']}")
        print(f"AI: {r['ai_answer']}")
        print(f"TRUE: {r['true_answer']}")


Query                                              | AI Response                                                  | Embed Score  | Similarity | LLM Score
------------------------------------------------------------------------------------------------------------------------------------------------------
Where can I find emergency food in ZIP code 17104? | Okay, I can help you find emergency food resources in the... | 1            | 0.916      | 0.5      
I lost power in 17104. Where can I go to get ic... | Okay, I can help you find places to get ice and charge yo... | 0.5          | 0.876      | 0.5      
Is there an emergency shelter near 17104 right ... | I am an AI and do not have access to real-time informatio... | 0.5          | 0.855      | 0.5      
Where can seniors in 17104 get help with medica... | Okay, let's break down how seniors in the 17104 zip code ... | 0.5          | 0.872      | 1        
How can I reach someone for water delivery in Z... | Okay, here are a few ways 

In [4]:
import json

# --- Load evaluation results ---
with open("evaluation_results.json", "r") as f:
    results = json.load(f)

# --- Add a new 'error' field to each result ---
for r in results:
    # We'll define an error as any response that isn't a perfect match (score < 1)
    r['error'] = 'Yes' if r['embedding_score'] < 1 else 'No'

# --- Print results in a readable table with the new 'Error' column ---
print(f"{'Query':<50} | {'AI Response':<60} | {'Embed Score':<12} | {'Similarity':<10} | {'LLM Score':<9} | {'Error':<5}")
print("-" * 170)

for r in results:
    # Truncate long query and AI response for display
    query_preview = (r['query'][:47] + "...") if len(r['query']) > 50 else r['query']
    ai_preview = (r['ai_answer'][:57] + "...") if len(r['ai_answer']) > 60 else r['ai_answer']

    print(f"{query_preview:<50} | {ai_preview:<60} | {r['embedding_score']:<12} | {r['similarity']:<10} | {r['llm_score']:<9} | {r['error']:<5}")

# --- Optional: Show only incorrect answers with details ---
print("\nIncorrect answers (Embed Score = 0):")
for r in results:
    if r['embedding_score'] == 0:
        print(f"\nQ: {r['query']}")
        print(f"AI: {r['ai_answer']}")
        print(f"TRUE: {r['true_answer']}")

Query                                              | AI Response                                                  | Embed Score  | Similarity | LLM Score | Error
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Where can I find emergency food in ZIP code 17104? | Okay, I can help you find emergency food resources in the... | 1            | 0.916      | 0.5       | No   
I lost power in 17104. Where can I go to get ic... | Okay, I can help you find places to get ice and charge yo... | 0.5          | 0.876      | 0.5       | Yes  
Is there an emergency shelter near 17104 right ... | I am an AI and do not have access to real-time informatio... | 0.5          | 0.855      | 0.5       | Yes  
Where can seniors in 17104 get help with medica... | Okay, let's break down how seniors in the 17104 zip code ... | 0.5          | 0.872      | 1         | Yes  
How can I reach som