## Context-Enriched Retrieval in RAG
Retrieval-Augmented Generation (RAG) enhances AI responses by retrieving relevant knowledge from external sources. Traditional retrieval methods return isolated text chunks, which can lead to incomplete answers.

To address this, we introduce Context-Enriched Retrieval, which ensures that retrieved information includes neighboring chunks for better coherence.

Steps in This Notebook:
- Data Ingestion: Extract text from a PDF.
- Chunking with Overlapping Context: Split text into overlapping chunks to preserve context.
- Embedding Creation: Convert text chunks into numerical representations.
- Context-Aware Retrieval: Retrieve relevant chunks along with their neighbors for better completeness.
- Response Generation: Use a language model to generate responses based on retrieved context.
- Evaluation: Assess the model's response accuracy.

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## Extracting Text from a PDF File
To implement RAG, we first need a source of textual data. In this case, we extract text from a PDF file using the PyMuPDF library.

In [4]:
import fitz  # pip install PyMuPDF

def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file.

    Args:
        pdf_path (str): Path to the PDF file.

    Returns:
        str: Extracted text from the entire PDF.
    """
    # Open the PDF file
    doc = fitz.open(pdf_path)
    all_text = []

    # Iterate through each page in the PDF
    for page in doc:
        all_text.append(page.get_text("text"))

    doc.close()
    return "\n".join(all_text)

# Example usage:
pdf_file = '/Users/kekunkoya/Desktop/ISEM 770 Class Project/Homelessness.pdf'
text = extract_text_from_pdf(pdf_file)
print(text[:500])  # print the first 500 characters to verify


19
Defining and Measuring Homelessness
Volker Busch-Geertsema
GISS, Germany
>> Abstract_ Substantial progress has been made at EU level on defining home-
lessness. The European Typology on Homelessness and Housing Exclusion 
(ETHOS) is widely accepted in almost all European countries (and beyond) as 
a useful conceptual framework and almost everywhere definitions at national 
level (though often not identical with ETHOS) are discussed in relation to this 
typology. The development and some of th


## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

In [5]:
def chunk_text(text, n, overlap):
    """
    Chunks the given text into segments of n characters with overlap.

    Args:
    text (str): The text to be chunked.
    n (int): The number of characters in each chunk.
    overlap (int): The number of overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Loop through the text with a step size of (n - overlap)
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from index i to i + n to the chunks list
        chunks.append(text[i:i + n])

    return chunks  # Return the list of text chunks

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [6]:
# Initialize the OpenAI client with the base URL and API key
client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")  # Retrieve the API key from environment variables
)

## Extracting and Chunking Text from a PDF File
Now, we load the PDF, extract text, and split it into chunks.

In [7]:
# Define the path to the PDF file
pdf_path = "Homelessness.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Chunk the extracted text into segments of 1000 characters with an overlap of 200 characters
text_chunks = chunk_text(extracted_text, 1000, 200)

# Print the number of text chunks created
print("Number of text chunks:", len(text_chunks))

# Print the first text chunk
print("\nFirst text chunk:")
print(text_chunks[0])

Number of text chunks: 65

First text chunk:
19
Defining and Measuring Homelessness
Volker Busch-Geertsema
GISS, Germany
>> Abstract_ Substantial progress has been made at EU level on defining home-
lessness. The European Typology on Homelessness and Housing Exclusion 
(ETHOS) is widely accepted in almost all European countries (and beyond) as 
a useful conceptual framework and almost everywhere definitions at national 
level (though often not identical with ETHOS) are discussed in relation to this 
typology. The development and some of the remaining controversial issues 
concerning ETHOS and a reduced version of it are discussed in this chapter. 
Furthermore essential reasons and different approaches to measure home-
lessness are presented. It is argued that a single number will not be enough 
to understand homelessness and monitor progress in tackling it. More 
research and more work to improve information on homelessness at national 
levels will be needed before we can achieve compa

## Creating Embeddings for Text Chunks
Embeddings transform text into numerical vectors, which allow for efficient similarity search.

In [8]:
def create_embeddings(text, model="text-embedding-3-small"):
    """
    Creates embeddings for the given text using the specified OpenAI model.

    Args:
    text (str): The input text for which embeddings are to be created.
    model (str): The model to be used for creating embeddings. Default is "BAAI/bge-en-icl".

    Returns:
    dict: The response from the OpenAI API containing the embeddings.
    """
    # Create embeddings for the input text using the specified model
    response = client.embeddings.create(
        model=model,
        input=text
    )

    return response  # Return the response containing the embeddings

# Create embeddings for the text chunks
response = create_embeddings(text_chunks)

## Implementing Context-Aware Semantic Search
We modify retrieval to include neighboring chunks for better context.

In [9]:
def cosine_similarity(vec1, vec2):
    """
    Calculates the cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): The first vector.
    vec2 (np.ndarray): The second vector.

    Returns:
    float: The cosine similarity between the two vectors.
    """
    # Compute the dot product of the two vectors and divide by the product of their norms
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [10]:
def context_enriched_search(query, text_chunks, embeddings, k=1, context_size=1):
    """
    Retrieves the most relevant chunk along with its neighboring chunks.

    Args:
    query (str): Search query.
    text_chunks (List[str]): List of text chunks.
    embeddings (List[dict]): List of chunk embeddings.
    k (int): Number of relevant chunks to retrieve.
    context_size (int): Number of neighboring chunks to include.

    Returns:
    List[str]: Relevant text chunks with contextual information.
    """
    # Convert the query into an embedding vector
    query_embedding = create_embeddings(query).data[0].embedding
    similarity_scores = []

    # Compute similarity scores between query and each text chunk embedding
    for i, chunk_embedding in enumerate(embeddings):
        # Calculate cosine similarity between the query embedding and current chunk embedding
        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
        # Store the index and similarity score as a tuple
        similarity_scores.append((i, similarity_score))

    # Sort chunks by similarity score in descending order (highest similarity first)
    similarity_scores.sort(key=lambda x: x[1], reverse=True)

    # Get the index of the most relevant chunk
    top_index = similarity_scores[0][0]

    # Define the range for context inclusion
    # Ensure we don't go below 0 or beyond the length of text_chunks
    start = max(0, top_index - context_size)
    end = min(len(text_chunks), top_index + context_size + 1)

    # Return the relevant chunk along with its neighboring context chunks
    return [text_chunks[i] for i in range(start, end)]

## Running a Query with Context Retrieval
We now test the context-enriched retrieval.

In [11]:
# Load the validation dataset from a JSON file
with open('/Users/kekunkoya/Desktop/ISEM 770 Class Project/valh.json') as f:
    data = json.load(f)

# Extract the first question from the dataset to use as our query
query = data[0]['question']

# Retrieve the most relevant chunk and its neighboring chunks for context
# Parameters:
# - query: The question we're searching for
# - text_chunks: Our text chunks extracted from the PDF
# - response.data: The embeddings of our text chunks
# - k=1: Return the top match
# - context_size=1: Include 1 chunk before and after the top match for context
top_chunks = context_enriched_search(query, text_chunks, response.data, k=1, context_size=1)

# Print the query for reference
print("Query:", query)
# Print each retrieved chunk with a heading and separator
for i, chunk in enumerate(top_chunks):
    print(f"Context {i + 1}:\n{chunk}\n=====================================")

Query: What is the ETHOS typology?
Context 1:
 homelessness, while people living in insecure and/or inadequate housing 
and/or in social isolation might also be affected by exclusion from one or two domains, 
but their situation is classified under ‘housing exclusion’ rather than ‘homelessness’.
On the basis of this conceptional understanding and to try to grasp the varying 
practices in different EU countries, the ETHOS typology was developed, which 
relates, in its most recent version, thirteen different operational categories and 
twenty-four different living situations to the four conceptional categories: roofless, 
houseless, insecure housing and inadequate housing.4 See Table 1.2.
4	
Apart from documenting progress concerning the measurement of homelessness in different 
EU countries and reporting on the latest available data, the forth and fifth reviews of statistics 
(Edgar and Meert, 2005, 2006) focused on developing and refining the ETHOS definition and 
considering the measu

## Generating a Response Using Retrieved Context
We now generate a response using LLM.

In [12]:
# Define the system prompt for the AI assistant
system_prompt = "You are an AI assistant that strictly answers based on the given context. If the answer cannot be derived directly from the provided context, respond with: 'I do not have enough information to answer that.'"

def generate_response(system_prompt, user_message, model="gpt-4o-mini"):
    """
    Generates a response from the AI model based on the system prompt and user message.

    Args:
    system_prompt (str): The system prompt to guide the Homelessness AI.
    user_message (str): The user's message or query.
    model (str): The model to be used for generating the response. Default is "gpt-4o-mini".

    Returns:
    dict: The response from the AI model.
    """
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ]
    )
    return response

# Create the user prompt based on the top chunks
user_prompt = "\n".join([f"Context {i + 1}:\n{chunk}\n=====================================\n" for i, chunk in enumerate(top_chunks)])
user_prompt = f"{user_prompt}\nQuestion: {query}"

# Generate AI response
ai_response = generate_response(system_prompt, user_prompt)

## Evaluating the AI Response
We compare the AI response with the expected answer and assign a score.

In [13]:
# Define the system prompt for the evaluation system
evaluate_system_prompt = "You are an intelligent evaluation system tasked with assessing the AI assistant's responses. If the AI assistant's response is very close to the true response, assign a score of 1. If the response is incorrect or unsatisfactory in relation to the true response, assign a score of 0. If the response is partially aligned with the true response, assign a score of 0.5."

# Create the evaluation prompt by combining the user query, AI response, true response, and evaluation system prompt
evaluation_prompt = f"User Query: {query}\nAI Response:\n{ai_response.choices[0].message.content}\nTrue Response: {data[0]['ideal_answer']}\n{evaluate_system_prompt}"

# Generate the evaluation response using the evaluation system prompt and evaluation prompt
evaluation_response = generate_response(evaluate_system_prompt, evaluation_prompt)

# Print the evaluation response
print(evaluation_response.choices[0].message.content)

Score: 1
