## Evaluating Chunk Sizes in Simple RAG

Choosing the right chunk size is crucial for improving retrieval accuracy in a Retrieval-Augmented Generation (RAG) pipeline. The goal is to balance retrieval performance with response quality.

This section evaluates different chunk sizes by:

1. Extracting text from a PDF.
2. Splitting text into chunks of varying sizes.
3. Creating embeddings for each chunk.
4. Retrieving relevant chunks for a query.
5. Generating a response using retrieved chunks.
6. Evaluating faithfulness and relevancy.
7. Comparing results for different chunk sizes.

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import fitz
import os
import numpy as np
import json
from openai import OpenAI

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [2]:
# Initialize the OpenAI client with the base URL and API key
client = OpenAI(
    api_key=os.getenv("OPENAI_API_KEY")  # Retrieve the API key from environment variables
)

## Extracting Text from the PDF
First, we will extract text from the `AI_Information.pdf` file.

In [3]:
def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a PDF file.

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    # Open the PDF file
    mypdf = fitz.open("/Users/kekunkoya/Desktop/ISEM 770 Class Project/AI_Information.pdf")
    all_text = ""  # Initialize an empty string to store the extracted text
    
    # Iterate through each page in the PDF
    for page in mypdf:
        # Extract text from the current page and add spacing
        all_text += page.get_text("text") + " "

    # Return the extracted text, stripped of leading/trailing whitespace
    return all_text.strip()

# Define the path to the PDF file
pdf_path = "data/AI_Information.pdf"

# Extract text from the PDF file
extracted_text = extract_text_from_pdf(pdf_path)

# Print the first 500 characters of the extracted text
print(extracted_text[:500])

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


## Chunking the Extracted Text
To improve retrieval, we split the extracted text into overlapping chunks of different sizes.

In [4]:
def chunk_text(text, n, overlap):
    """
    Splits text into overlapping chunks.

    Args:
    text (str): The text to be chunked.
    n (int): Number of characters per chunk.
    overlap (int): Overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []  # Initialize an empty list to store the chunks
    for i in range(0, len(text), n - overlap):
        # Append a chunk of text from the current index to the index + chunk size
        chunks.append(text[i:i + n])
    
    return chunks  # Return the list of text chunks

# Define different chunk sizes to evaluate
chunk_sizes = [128, 256, 512]

# Create a dictionary to store text chunks for each chunk size
text_chunks_dict = {size: chunk_text(extracted_text, size, size // 5) for size in chunk_sizes}

# Print the number of chunks created for each chunk size
for size, chunks in text_chunks_dict.items():
    print(f"Chunk Size: {size}, Number of Chunks: {len(chunks)}")

Chunk Size: 128, Number of Chunks: 326
Chunk Size: 256, Number of Chunks: 164
Chunk Size: 512, Number of Chunks: 82


## Creating Embeddings for Text Chunks
Embeddings convert text into numerical representations for similarity search.

In [6]:
from tqdm import tqdm

def create_embeddings(texts, model="text-embedding-3-small"):
    """
    Generates embeddings for a list of texts.

    Args:
    texts (List[str]): List of input texts.
    model (str): Embedding model.

    Returns:
    List[np.ndarray]: List of numerical embeddings.
    """
    # Create embeddings using the specified model
    response = client.embeddings.create(model=model, input=texts)
    # Convert the response to a list of numpy arrays and return
    return [np.array(embedding.embedding) for embedding in response.data]

# Generate embeddings for each chunk size
# Iterate over each chunk size and its corresponding chunks in the text_chunks_dict
chunk_embeddings_dict = {size: create_embeddings(chunks) for size, chunks in tqdm(text_chunks_dict.items(), desc="Generating Embeddings")}

Generating Embeddings: 100%|██████████| 3/3 [00:03<00:00,  1.02s/it]


## Performing Semantic Search
We use cosine similarity to find the most relevant text chunks for a user query.

In [7]:
def cosine_similarity(vec1, vec2):
    """
    Computes cosine similarity between two vectors.

    Args:
    vec1 (np.ndarray): First vector.
    vec2 (np.ndarray): Second vector.

    Returns:
    float: Cosine similarity score.
    """

    # Compute the dot product of the two vectors
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

In [8]:
def retrieve_relevant_chunks(query, text_chunks, chunk_embeddings, k=5):
    """
    Retrieves the top-k most relevant text chunks.
    
    Args:
    query (str): User query.
    text_chunks (List[str]): List of text chunks.
    chunk_embeddings (List[np.ndarray]): Embeddings of text chunks.
    k (int): Number of top chunks to return.
    
    Returns:
    List[str]: Most relevant text chunks.
    """
    # Generate an embedding for the query - pass query as a list and get first item
    query_embedding = create_embeddings([query])[0]
    
    # Calculate cosine similarity between the query embedding and each chunk embedding
    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]
    
    # Get the indices of the top-k most similar chunks
    top_indices = np.argsort(similarities)[-k:][::-1]
    
    # Return the top-k most relevant text chunks
    return [text_chunks[i] for i in top_indices]

In [14]:
import os
import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# --- 1) Load your validation data ---
val_path = '/Users/kekunkoya/Desktop/ISEM 770 Class Project/val.json'
if not os.path.isfile(val_path):
    raise FileNotFoundError(f"Could not find validation file at: {val_path!r}")
with open(val_path, 'r', encoding='utf-8') as f:
    data = json.load(f)
query = data[3]['question']

# --- 2) Define your chunk sizes and sample text/embeddings dicts ---
#    Replace these with your actual chunks and embeddings.
chunk_sizes = [128, 256, 512]
text_chunks_dict = {
    size: ["chunk1 for size "+str(size), "chunk2 for size "+str(size)]  # ← REPLACE
    for size in chunk_sizes
}
# here we simulate embeddings as random vectors; replace with your real embeddings
chunk_embeddings_dict = {
    size: np.random.rand(len(text_chunks_dict[size]), 768)
    for size in chunk_sizes
}

# --- 3) Define retrieval function inline ---
def retrieve_relevant_chunks(query: str,
                             chunks: list[str],
                             embeddings: np.ndarray,
                             top_k: int = 5) -> list[str]:
    """
    Example using cosine similarity. Replace `embed_query`
    with however you turn your query into a vector.
    """
    # --- a) Embed the query (stubbed as random) ---
    #    Replace this with your real query embedding call!
    query_emb = np.random.rand(1, embeddings.shape[1])
    
    # --- b) Compute similarities and pick top_k ---
    sims = cosine_similarity(query_emb, embeddings)[0]
    top_idx = np.argsort(sims)[::-1][:top_k]
    return [chunks[i] for i in top_idx]

# --- 4) Run retrieval across sizes ---
retrieved_chunks_dict = {
    size: retrieve_relevant_chunks(
        query,
        text_chunks_dict[size],
        chunk_embeddings_dict[size]
    )
    for size in chunk_sizes
}

# --- 5) Print results for size=256 ---
print("Top chunks for chunk_size=256:")
print(retrieved_chunks_dict.get(256, 'No chunks for 256'))


Top chunks for chunk_size=256:
['chunk1 for size 256', 'chunk2 for size 256']


## Generating a Response Based on Retrieved Chunks
Let's  generate a response based on the retrieved text for chunk size `256`.

In [17]:
import os
import json
from openai import OpenAI

# 0) Initialize your OpenAI client
#    Make sure OPENAI_API_KEY is set in your environment
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# 1) Define the system prompt
system_prompt = (
    "You are an AI assistant that strictly answers based on the given context. "
    "If the answer cannot be derived directly from the provided context, "
    "respond with: 'I do not have enough information to answer that.'"
)

# 2) Define the response generator
def generate_response(query, system_prompt, retrieved_chunks, model="gpt-4o-mini"):
    """
    Generates an AI response based on retrieved chunks.
    """
    # Combine retrieved chunks into a single context string
    context = "\n\n".join([f"Context {i+1}:\n{chunk}" 
                            for i, chunk in enumerate(retrieved_chunks)])
    
    # Build the messages payload
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": f"{context}\n\nQuestion: {query}"}
    ]
    
    # Call the chat completion endpoint
    response = client.chat.completions.create(
        model=model,
        temperature=0,
        messages=messages
    )
    return response.choices[0].message.content

# 3) (Re)define your query and chunk_sizes if not already in scope
#    query = data[3]['question']
#    chunk_sizes = [128, 256, 512]
#    retrieved_chunks_dict = {...}  # as built previously

# 4) Generate responses for each chunk size
ai_responses_dict = {
    size: generate_response(query, system_prompt, retrieved_chunks_dict[size])
    for size in chunk_sizes
}

# 5) Print the response for chunk size 256
print("AI response (chunk_size=256):")
print(ai_responses_dict.get(256, "No response for size 256"))


AI response (chunk_size=256):
I do not have enough information to answer that.


## Evaluating the AI Response
We score responses based on faithfulness and relevancy using powerfull llm

In [18]:
# Define evaluation scoring system constants
SCORE_FULL = 1.0     # Complete match or fully satisfactory
SCORE_PARTIAL = 0.5  # Partial match or somewhat satisfactory
SCORE_NONE = 0.0     # No match or unsatisfactory

In [19]:
# Define strict evaluation prompt templates
FAITHFULNESS_PROMPT_TEMPLATE = """
Evaluate the faithfulness of the AI response compared to the true answer.
User Query: {question}
AI Response: {response}
True Answer: {true_answer}

Faithfulness measures how well the AI response aligns with facts in the true answer, without hallucinations.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely faithful, no contradictions with true answer
    * {partial} = Partially faithful, minor contradictions
    * {none} = Not faithful, major contradictions or hallucinations
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""

In [22]:
RELEVANCY_PROMPT_TEMPLATE = """
Evaluate the relevancy of the AI response to the user query.
User Query: {question}
AI Response: {response}

Relevancy measures how well the response addresses the user's question.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely relevant, directly addresses the query
    * {partial} = Partially relevant, addresses some aspects
    * {none} = Not relevant, fails to address the query
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""

In [23]:
import os
import json
from openai import OpenAI

# 0) Initialize OpenAI client (make sure OPENAI_API_KEY is set)
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# 1) Load your validation data
val_path = '/Users/kekunkoya/Desktop/ISEM 770 Class Project/val.json'
if not os.path.isfile(val_path):
    raise FileNotFoundError(f"Could not find {val_path!r}")
with open(val_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

query = data[3]['question']
true_answer = data[3]['ideal_answer']

# 2) Define your prompts & scoring constants
SCORE_FULL    = "1.0"
SCORE_PARTIAL = "0.5"
SCORE_NONE    = "0.0"

FAITHFULNESS_PROMPT_TEMPLATE = """
Evaluate the FAITHFULNESS of the assistant’s response given the true answer.
Question: {question}
Response: {response}
True Answer: {true_answer}
Return:
- {full} if the response is fully supported by the true answer.
- {partial} if it’s partially supported.
- {none} if it’s unsupported.
Just output ONLY the numeric score.
"""

RELEVANCY_PROMPT_TEMPLATE = """
Evaluate the RELEVANCY of the assistant’s response to the user’s question.
Question: {question}
Response: {response}
Return:
- {full} if the response directly addresses the question.
- {partial} if it somewhat addresses it.
- {none} if it does not address it.
Just output ONLY the numeric score.
"""

# 3) Define the evaluator
def evaluate_response(question, response, true_answer):
    # Build prompts
    faith_prompt = FAITHFULNESS_PROMPT_TEMPLATE.format(
        question=question,
        response=response,
        true_answer=true_answer,
        full=SCORE_FULL,
        partial=SCORE_PARTIAL,
        none=SCORE_NONE
    )
    rel_prompt = RELEVANCY_PROMPT_TEMPLATE.format(
        question=question,
        response=response,
        full=SCORE_FULL,
        partial=SCORE_PARTIAL,
        none=SCORE_NONE
    )

    # Ask the LLM (use a chat model, not an embedding model)
    faith_resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role": "system", "content": "You are an objective evaluator. Return ONLY the numeric score."},
            {"role": "user",   "content": faith_prompt}
        ]
    )
    rel_resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role": "system", "content": "You are an objective evaluator. Return ONLY the numeric score."},
            {"role": "user",   "content": rel_prompt}
        ]
    )

    # Parse scores
    try:
        faith_score = float(faith_resp.choices[0].message.content.strip())
    except ValueError:
        faith_score = 0.0
    try:
        rel_score = float(rel_resp.choices[0].message.content.strip())
    except ValueError:
        rel_score = 0.0

    return faith_score, rel_score

# 4) Assume you've already generated `ai_responses_dict` and `chunk_sizes`
#    e.g., ai_responses_dict = {256: "...", 128: "...", ...}

# Evaluate for chunk sizes 256 and 128
faith256, rel256 = evaluate_response(query, ai_responses_dict[256], true_answer)
faith128, rel128 = evaluate_response(query, ai_responses_dict[128], true_answer)

print(f"Faithfulness (256): {faith256}, Relevancy (256): {rel256}")
print(f"Faithfulness (128): {faith128}, Relevancy (128): {rel128}")


Faithfulness (256): 0.0, Relevancy (256): 0.0
Faithfulness (128): 0.0, Relevancy (128): 0.0
