## Context-Enriched Retrieval in RAG
Retrieval-Augmented Generation (RAG) enhances AI responses by retrieving relevant knowledge from external sources. Traditional retrieval methods return isolated text chunks, which can lead to incomplete answers.

To address this, we introduce Context-Enriched Retrieval, which ensures that retrieved information includes neighboring chunks for better coherence.

Steps in This Notebook:
- Data Ingestion: Extract text from a PDF.
- Chunking with Overlapping Context: Split text into overlapping chunks to preserve context.
- Embedding Creation: Convert text chunks into numerical representations.
- Context-Aware Retrieval: Retrieve relevant chunks along with their neighbors for better completeness.
- Response Generation: Use a language model to generate responses based on retrieved context.
- Evaluation: Assess the model's response accuracy.

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [2]:
import fitz  # pip install PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the entire PDF.
    """
    # Open the PDF file
    doc = fitz.open(pdf_path)
    all_text = []

    # Iterate through each page in the PDF
    for page in doc:
        all_text.append(page.get_text("text"))

    doc.close()
    return "\n".join(all_text)

# Example usage:
pdf_file = '/Users/kekunkoya/Desktop/RAG Project/PEMA.pdf'
text = extract_text_from_pdf(pdf_file)
print(text[:500])  # print the first 500 characters to verify


PENNSYLVANIA
EMERGENCY
PREPAREDNESS
GUIDE
Be Informed. Be Prepared. Be Involved. 
www.Ready.PA.gov 
readypa@pa.gov

Emergency Preparedness Guide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Table of Contents
TABLE OF CONTENTS  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pages 2-3
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  Page


Using Fritz to extract the text from the PDF, and printing out the first 500 text

## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [3]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [4]:
# Initialize the OpenAI client with the base URL and API key
client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")  # Retrieve the API key from environment variables
)

## Extracting and Chunking Text from a PDF File
Now, we load the PDF, extract text, and split it into chunks.

In [12]:
# Define the path to the PDF file
pdf_path = "/Users/kekunkoya/Desktop/RAG Project/Resources.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters
text_chunks = chunk_text(extracted_text, 1000, 200)

# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))

# Print the first text chunk
print("\nFirst text chunk:")
print(text_chunks[0])

Number of text chunks: 15

First text chunk:
Agency Name *
Site Name *
Service Name *
Site Main Phone 
Number
Service Eligibility
Lancaster County 
Housing and 
Redevelopment 
Authorities
Lancaster County Housing 
and Redevelopment 
Authorities
Home Repair Program
717-394-0793
Based on annual gross income, 
according to family size; Available 
equity in home
Housing and Repairs
Lancaster County 
Housing and 
Redevelopment 
Authorities
Lancaster County Housing 
and Redevelopment 
Authorities
Public Infrastructure and 
Community Facilities Grant 
Administration
717-394-0793
Local municipalities outside the 
city of Lancaster

Lancaster County 
Housing and 
Redevelopment 
Authorities
Lancaster County Housing 
and Redevelopment 
Authorities
Rental Housing Program
717-394-0793
Open to rental housing 
developers only; Properties must 
be located in Lancaster County, 
outside of Lancaster City
Lancaster County 
Housing and 
Redevelopment 
Authorities
Lancaster County Housing 
and Redevelopmen

## Creating Embeddings for Text Chunks
Embeddings transform text into numerical vectors, which allow for efficient similarity search.

In [13]:
def create_embeddings(text, model="text-embedding-3-small"):
    """
    Creates embeddings for the given text using the specified OpenAI model.

    Args:
    text (str): The input text for which embeddings are to be created.
    model (str): The model to be used for creating embeddings. Default is "BAAI/bge-en-icl".

    Returns:
    dict: The response from the OpenAI API containing the embeddings.
    """
    # Create embeddings for the input text using the specified model
    response = client.embeddings.create(
        model=model,
        input=text
    )

    return response  # Return the response containing the embeddings

# Create embeddings for the text chunks
response = create_embeddings(text_chunks)

## Implementing Context-Aware Semantic Search
We modify retrieval to include neighboring chunks for better context.

In [14]:
def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): The first vector.
    vec2 (np.ndarray): The second vector.

    Returns:
    float: The cosine similarity between the two vectors.
    """
    # Compute the dot product of the two vectors and divide by the product of their norms
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [15]:
def context_enriched_search(query, text_chunks, embeddings, k=1, context_size=1):
    """
    Retrieves the most relevant chunk along with its neighboring chunks.

    Args:
    query (str): Search query.
    text_chunks (List[str]): List of text chunks.
    embeddings (List[dict]): List of chunk embeddings.
    k (int): Number of relevant chunks to retrieve.
    context_size (int): Number of neighboring chunks to include.

    Returns:
    List[str]: Relevant text chunks with contextual information.
    """
    # Convert the query into an embedding vector
    query_embedding = create_embeddings(query).data[0].embedding
    similarity_scores = []

    # Compute similarity scores between query and each text chunk embedding
    for i, chunk_embedding in enumerate(embeddings):
        # Calculate cosine similarity between the query embedding and current chunk embedding
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
        # Store the index and similarity score as a tuple
        similarity_scores.append((i, similarity_score))

    # Sort chunks by similarity score in descending order (highest similarity first)
    similarity_scores.sort(key=lambda x: x[1], reverse=True)

    # Get the index of the most relevant chunk
    top_index = similarity_scores[0][0]

    # Define the range for context inclusion
    # Ensure we don't go below 0 or beyond the length of text_chunks
    start = max(0, top_index - context_size)
    end = min(len(text_chunks), top_index + context_size + 1)

    # Return the relevant chunk along with its neighboring context chunks
    return [text_chunks[i] for i in range(start, end)]

## Running a Query with Context Retrieval
We now test the context-enriched retrieval.

In [16]:
# Load the validation dataset from a JSON file
with open('/Users/kekunkoya/Desktop/RAG Project/PA211_dataset.json') as f:
    data = json.load(f)

# Extract the first question from the dataset to use as our query
query = data[0]['question']

# Retrieve the most relevant chunk and its neighboring chunks for context
# Parameters:
# - query: The question we're searching for
# - text_chunks: Our text chunks extracted from the PDF
# - response.data: The embeddings of our text chunks
# - k=1: Return the top match
# - context_size=1: Include 1 chunk before and after the top match for context
top_chunks = context_enriched_search(query, text_chunks, response.data, k=1, context_size=1)

# Print the query for reference
print("Query:", query)
# Print each retrieved chunk with a heading and separator
for i, chunk in enumerate(top_chunks):
    print(f"Context {i + 1}:\n{chunk}\n=====================================")

Query: Where can I find emergency food in ZIP code 17104?
Context 1:
rrowers that are 
considering a PHFA loan product 
and have a FICO credit score 
lower than 660 are required to 
complete a course prior to closing 
on their loan.
Pennsylvania Utility Law 
Project (PULP) Hotline
844-645-2500
Serves individuals and families in 
Pennsylvania who are facing a 
utility shutoff or are already 
without service.
Housing 
Discrimination 
Hotline
Harrisburg
Housing Discrimination 
Hotline
No limitations or restrictions
Regional Housing 
Legal Services
Pennsylvania Utility Law 
Project Office, Harrisburg

Pennsylvania Housing 
Finance Agency
Pennsylvania Housing 
Finance Agency
Homeowner's Emergency 
Mortgage Assistance 
Program (HEMAP)
A) Residents of Pennsylvania who 
are homeowners with mortgage 
delinquencies caused by 
circumstances beyond their 
control B) Must not be in 
foreclosure C) Must have the 
ability to regain financial stability 
within 24 months.
Manheim Central 
Food Pantry
M

## Generating a Response Using Retrieved Context
We now generate a response using LLM.

In [17]:
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="gpt-4o-mini"):
    """
    Generates a response from the AI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the Homelessness AI.
    user_message (str): The user's message or query.
    model (str): The model to be used for generating the response. Default is "gpt-4o-mini".

    Returns:
    dict: The response from the AI model.
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)

## Evaluating the AI Response
We compare the AI response with the expected answer and assign a score.

In [18]:
# Define the system prompt for the evaluation system
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# Print the evaluation response
print(evaluation_response.choices[0].message.content)

Score: 0


In [19]:
import json
import numpy as np
from openai import OpenAI

# --- Load the dataset ---
with open("PA211_dataset.json", "r") as f:
    data = json.load(f)

# --- Initialize OpenAI client ---
client = OpenAI()

# --- Cosine similarity function ---
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

# --- Embedding-based evaluation ---
def evaluate_response(user_query, ai_answer, true_answer):
    ai_emb = client.embeddings.create(
        model="text-embedding-3-small",
        input=ai_answer
    ).data[0].embedding
    
    true_emb = client.embeddings.create(
        model="text-embedding-3-small",
        input=true_answer
    ).data[0].embedding

    similarity = cosine_similarity(ai_emb, true_emb)

    if similarity > 0.90:
        score = 1
    elif similarity > 0.70:
        score = 0.5
    else:
        score = 0

    return score, similarity

# --- LLM verification ---
def llm_verify(user_query, ai_answer, true_answer):
    evaluation_prompt = f"""
You are an evaluation system. Compare the AI's answer with the true answer.

User Query: {user_query}

AI Answer:
{ai_answer}

True Answer:
{true_answer}

Output ONLY a number:
1 - Very close match
0.5 - Partial match
0 - Incorrect
"""
    llm_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": evaluation_prompt}],
        temperature=0
    )
    return llm_response.choices[0].message.content.strip()

# --- Main evaluation loop ---
results = []

for i, item in enumerate(data):
    user_query = item['question']
    true_answer = item['ideal_answer']

    # Step 1: Generate AI answer (replace with your model)
    ai_gen = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": user_query}],
        temperature=0
    )
    ai_answer = ai_gen.choices[0].message.content

    # Step 2: Embedding-based scoring
    score, similarity = evaluate_response(user_query, ai_answer, true_answer)

    # Step 3: Optional LLM verification
    llm_score = llm_verify(user_query, ai_answer, true_answer)

    # Step 4: Store results
    results.append({
        "query": user_query,
        "ai_answer": ai_answer,
        "true_answer": true_answer,
        "embedding_score": score,
        "similarity": round(similarity, 3),
        "llm_score": llm_score
    })

# --- Save results to file ---
with open("evaluation_results.json", "w") as f:
    json.dump(results, f, indent=2)

print("✅ Evaluation complete! Results saved to evaluation_results.json")


✅ Evaluation complete! Results saved to evaluation_results.json


In [20]:
import json

# --- Load evaluation results ---
with open("evaluation_results.json", "r") as f:
    results = json.load(f)

# --- Print results in a readable table ---
print(f"{'Query':<50} | {'Embed Score':<12} | {'Similarity':<10} | {'LLM Score':<9}")
print("-" * 90)

for r in results:
    query_preview = (r['query'][:47] + "...") if len(r['query']) > 50 else r['query']
    print(f"{query_preview:<50} | {r['embedding_score']:<12} | {r['similarity']:<10} | {r['llm_score']:<9}")

# --- Optional: Show only incorrect answers ---
print("\n Incorrect answers (Embed Score = 0):")
for r in results:
    if r['embedding_score'] == 0:
        print(f"- {r['query']}")


Query                                              | Embed Score  | Similarity | LLM Score
------------------------------------------------------------------------------------------
Where can I find emergency food in ZIP code 17104? | 0.5          | 0.828      | 0.5      
I lost power in 17104. Where can I go to get ic... | 0            | 0.697      | 0.5      
Is there an emergency shelter near 17104 right ... | 0            | 0.586      | 0.5      
Where can seniors in 17104 get help with medica... | 0            | 0.699      | 0.5      
How can I reach someone for water delivery in Z... | 0            | 0.686      | 0.5      
What food programs are available for children a... | 0.5          | 0.762      | 1        
Where can I get culturally appropriate food aft... | 0            | 0.63       | 0.5      
Is it safe to drink the tap water in 17104 afte... | 0            | 0.64       | 1        
What if I need water for medical equipment but ... | 0            | 0.536      | 0.5      

In [21]:
import json

# --- Load evaluation results ---
with open("evaluation_results.json", "r") as f:
    results = json.load(f)

# --- Print results in a readable table ---
print(f"{'Query':<50} | {'AI Response':<60} | {'Embed Score':<12} | {'Similarity':<10} | {'LLM Score':<9}")
print("-" * 150)

for r in results:
    # Truncate long query and AI response for display
    query_preview = (r['query'][:47] + "...") if len(r['query']) > 50 else r['query']
    ai_preview = (r['ai_answer'][:57] + "...") if len(r['ai_answer']) > 60 else r['ai_answer']

    print(f"{query_preview:<50} | {ai_preview:<60} | {r['embedding_score']:<12} | {r['similarity']:<10} | {r['llm_score']:<9}")

# --- Optional: Show only incorrect answers with details ---
print("\n  Incorrect answers (Embed Score = 0):")
for r in results:
    if r['embedding_score'] == 0:
        print(f"\nQ: {r['query']}")
        print(f"AI: {r['ai_answer']}")
        print(f"TRUE: {r['true_answer']}")


Query                                              | AI Response                                                  | Embed Score  | Similarity | LLM Score
------------------------------------------------------------------------------------------------------------------------------------------------------
Where can I find emergency food in ZIP code 17104? | To find emergency food resources in ZIP code 17104, you c... | 0.5          | 0.828      | 0.5      
I lost power in 17104. Where can I go to get ic... | If you're in the 17104 area and have lost power, here are... | 0            | 0.697      | 0.5      
Is there an emergency shelter near 17104 right ... | I don't have real-time data access to check for emergency... | 0            | 0.586      | 0.5      
Where can seniors in 17104 get help with medica... | Seniors in the 17104 area (Harrisburg, Pennsylvania) can ... | 0            | 0.699      | 0.5      
How can I reach someone for water delivery in Z... | To arrange for water deliv

In [3]:
import json

# --- Load evaluation results ---
with open("evaluation_results.json", "r") as f:
    results = json.load(f)

# --- Print results in a readable table ---
print(f"{'Query':<50} | {'AI Response':<60} | {'Embed Score':<12} | {'Similarity':<10} | {'LLM Score':<9} | {'Error Score':<11}")
print("-" * 170)

for r in results:
    # Truncate long query and AI response for display
    query_preview = (r['query'][:47] + "...") if len(r['query']) > 50 else r['query']
    ai_preview = (r['ai_answer'][:57] + "...") if len(r['ai_answer']) > 60 else r['ai_answer']

    # Ensure LLM score is a float
    try:
        llm_score = float(r.get('llm_score', 0))
    except (TypeError, ValueError):
        llm_score = 0.0

    # Calculate Error Score
    error_score = round(1 - llm_score, 2)

    print(f"{query_preview:<50} | {ai_preview:<60} | {r['embedding_score']:<12} | {r['similarity']:<10} | {llm_score:<9} | {error_score:<11}")

# --- Optional: Show only incorrect answers with details ---
print("\n❌ Incorrect answers (Embed Score = 0):")
for r in results:
    if r['embedding_score'] == 0:
        print(f"\nQ: {r['query']}")
        print(f"AI: {r['ai_answer']}")
        print(f"TRUE: {r['true_answer']}")
        try:
            llm_score = float(r.get('llm_score', 0))
        except (TypeError, ValueError):
            llm_score = 0.0
        print(f"Error Score: {round(1 - llm_score, 2)}")


Query                                              | AI Response                                                  | Embed Score  | Similarity | LLM Score | Error Score
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Where can I find emergency food in ZIP code 17104? | To find emergency food resources in ZIP code 17104, you c... | 0.5          | 0.828      | 0.5       | 0.5        
I lost power in 17104. Where can I go to get ic... | If you're in the 17104 area and have lost power, here are... | 0            | 0.697      | 0.5       | 0.5        
Is there an emergency shelter near 17104 right ... | I don't have real-time data access to check for emergency... | 0            | 0.586      | 0.5       | 0.5        
Where can seniors in 17104 get help with medica... | Seniors in the 17104 area (Harrisburg, Pennsylvania) can ... | 0            | 0.699      | 0.5       | 0