## Evaluating Chunk Sizes in Simple RAG

Choosing the right chunk size is crucial for improving retrieval accuracy in a Retrieval-Augmented Generation (RAG) pipeline. The goal is to balance retrieval performance with response quality.

This section evaluates different chunk sizes by:

1. Extracting text from a PDF.
2. Splitting text into chunks of varying sizes.
3. Creating embeddings for each chunk.
4. Retrieving relevant chunks for a query.
5. Generating a response using retrieved chunks.
6. Evaluating faithfulness and relevancy.
7. Comparing results for different chunk sizes.

## Setting Up the Environment
We begin by importing necessary libraries.

In [1]:
import fitz # PyMuPDF
import os
import numpy as np
import json
import google.generativeai as genai

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

In [3]:

import fitz
import os
import google.generativeai as genai
from dotenv import load_dotenv


## Extracting Text from the PDF
First, we will extract text from the `AI_Information.pdf` file.

In [5]:


# --- 2. PDF Text Extraction Function ---
def extract_text_from_pdf(pdf_path: str) -> str:
    """
    Extracts text from a PDF file using PyMuPDF (fitz).

    Args:
    pdf_path (str): Path to the PDF file.

    Returns:
    str: Extracted text from the PDF.
    """
    all_text = ""
    try:
        # Open the PDF file using the provided path
        with fitz.open(pdf_path) as mypdf:
            # Iterate through each page in the PDF
            for page in mypdf:
                # Extract text from the current page
                all_text += page.get_text("text") + " "
    except Exception as e:
        print(f"Error reading PDF: {e}")
        return ""

    # Return the extracted text, stripped of leading/trailing whitespace
    return all_text.strip()

# --- 3. Main Logic ---
if __name__ == "__main__":
    # Define the path to the PDF file
    pdf_path = "/Users/kekunkoya/Desktop/770 Google /AI_Information.pdf"
    
    # Check if the file exists before trying to extract text
    if not os.path.exists(pdf_path):
        print(f"Error: PDF file not found at '{pdf_path}'")
        exit()

    # Extract text from the PDF file
    extracted_text = extract_text_from_pdf(pdf_path)

    if extracted_text:
        # Print the first 500 characters of the extracted text
        print("Successfully extracted text from the PDF.")
        print("First 500 characters:\n")
        print(extracted_text[:500])
    else:
        print("Failed to extract text.")

    

Successfully extracted text from the PDF.
First 500 characters:

Understanding Artificial Intelligence 
Chapter 1: Introduction to Artificial Intelligence 
Artificial intelligence (AI) refers to the ability of a digital computer or computer-controlled robot 
to perform tasks commonly associated with intelligent beings. The term is frequently applied to 
the project of developing systems endowed with the intellectual processes characteristic of 
humans, such as the ability to reason, discover meaning, generalize, or learn from past 
experience. Over the past f


## Chunking the Extracted Text
To improve retrieval, we split the extracted text into overlapping chunks of different sizes.

In [6]:
import os

# --- 1. Text Chunking Function ---
def chunk_text(text, n, overlap):
    """
    Splits text into overlapping chunks.

    Args:
    text (str): The text to be chunked.
    n (int): Number of characters per chunk.
    overlap (int): Overlapping characters between chunks.

    Returns:
    List[str]: A list of text chunks.
    """
    chunks = []
    # The step size for the loop is the chunk size minus the overlap
    step_size = n - overlap
    # Iterate through the text with the specified step size
    for i in range(0, len(text), step_size):
        # Append a chunk of text from the current index to the index + chunk size
        chunks.append(text[i:i + n])
    
    return chunks

# --- 2. Main Logic ---
if __name__ == "__main__":
    # Simulate extracted text from a PDF or other source
    extracted_text = """
    Artificial intelligence (AI) is intelligence demonstrated by machines, unlike the natural
    intelligence displayed by humans and animals. AI research has been defined as the field of study
    of intelligent agents, which refers to any device that perceives its environment and takes
    actions that maximize its chance of successfully achieving its goals. The term "artificial
    intelligence" had previously been used to describe machines that mimic and display "human"
    cognitive skills that are associated with the human mind, such as "learning" and "problem-solving".
    """

    # Define different chunk sizes to evaluate
    chunk_sizes = [128, 256, 512]

    # Create a dictionary to store text chunks for each chunk size
    # Overlap is set to 20% of the chunk size
    text_chunks_dict = {size: chunk_text(extracted_text, size, size // 5) for size in chunk_sizes}

    # Print the number of chunks created for each chunk size
    for size, chunks in text_chunks_dict.items():
        print(f"Chunk Size: {size}, Number of Chunks: {len(chunks)}")

Chunk Size: 128, Number of Chunks: 6
Chunk Size: 256, Number of Chunks: 3
Chunk Size: 512, Number of Chunks: 2


## Creating Embeddings for Text Chunks
Embeddings convert text into numerical representations for similarity search.

In [7]:
import numpy as np
import google.generativeai as genai
import os
from typing import List
from tqdm import tqdm

# --- 1. Gemini API Configuration ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Define the create_embeddings function for Gemini ---
def create_embeddings(texts: List[str], model: str = "models/embedding-001") -> List[np.ndarray]:
    """
    Generates embeddings for a list of texts using the Gemini API.

    Args:
    texts (List[str]): List of input texts.
    model (str): Embedding model.

    Returns:
    List[np.ndarray]: List of numerical embeddings.
    """
    try:
        # Create embeddings for a list of texts
        response = genai.embed_content(model=model, content=texts)
        # The API returns a single embedding list for the entire batch
        return [np.array(emb) for emb in response['embedding']]
    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return []

# --- 3. Main Logic (Re-implemented for a runnable example) ---
if __name__ == "__main__":
    # Simulate a dictionary of text chunks
    # This would come from a previous text chunking step
    text_chunks_dict = {
        128: ["This is chunk one.", "This is chunk two.", "This is chunk three."],
        256: ["This is a longer chunk one.", "This is a longer chunk two."]
    }

    # Generate embeddings for each chunk size
    # The tqdm progress bar will work as before
    chunk_embeddings_dict = {}
    for size, chunks in tqdm(text_chunks_dict.items(), desc="Generating Embeddings"):
        chunk_embeddings_dict[size] = create_embeddings(chunks)

    # Print the shape of the first embedding for verification
    for size, embeddings in chunk_embeddings_dict.items():
        if embeddings:
            print(f"\nChunk Size: {size}, Number of Embeddings: {len(embeddings)}")
            print(f"First embedding shape: {embeddings[0].shape}")
        else:
            print(f"\nChunk Size: {size}, Failed to generate embeddings.")

Generating Embeddings: 100%|██████████| 2/2 [00:00<00:00,  2.19it/s]


Chunk Size: 128, Number of Embeddings: 3
First embedding shape: (768,)

Chunk Size: 256, Number of Embeddings: 2
First embedding shape: (768,)





## Performing Semantic Search
We use cosine similarity to find the most relevant text chunks for a user query.

In [8]:
import numpy as np
import google.generativeai as genai
import os
from typing import List

# --- 1. Gemini API Configuration ---
# Set your GOOGLE_API_KEY as an environment variable
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 2. Helper function to get embeddings from Gemini ---
def get_embedding(text: str, model: str = "models/embedding-001") -> np.ndarray:
    """
    Creates an embedding for a given text using the Gemini API.
    """
    try:
        response = genai.embed_content(model=model, content=text)
        return np.array(response['embedding'], dtype=np.float32)
    except Exception as e:
        print(f"An error occurred: {e}")
        return np.array([])

# --- 3. Your original cosine_similarity function ---
def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """
    Computes cosine similarity between two vectors.
    """
    dot_product = np.dot(vec1, vec2)
    norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)
    # Handle the case where a norm is zero to prevent division by zero
    return dot_product / norm_product if norm_product != 0 else 0.0

# --- 4. Main Logic ---
if __name__ == "__main__":
    # Define two sentences with similar meaning
    sentence1 = "The cat sat on the mat."
    sentence2 = "A feline rested on the rug."
    
    # Define a sentence with a different meaning
    sentence3 = "The car drove on the highway."

    print("Generating embeddings with Gemini...")
    vec1 = get_embedding(sentence1)
    vec2 = get_embedding(sentence2)
    vec3 = get_embedding(sentence3)

    if vec1.size > 0 and vec2.size > 0 and vec3.size > 0:
        # Calculate and print the cosine similarity between the similar sentences
        print(f"\nComparing '{sentence1}' and '{sentence2}'...")
        similarity_1_2 = cosine_similarity(vec1, vec2)
        print(f"Cosine Similarity: {similarity_1_2:.4f}")

        # Calculate and print the cosine similarity between the dissimilar sentences
        print(f"\nComparing '{sentence1}' and '{sentence3}'...")
        similarity_1_3 = cosine_similarity(vec1, vec3)
        print(f"Cosine Similarity: {similarity_1_3:.4f}")

        # Expected output: similarity_1_2 should be a high value (close to 1), 
        # and similarity_1_3 should be a low value (closer to 0).
    else:
        print("\nFailed to generate embeddings.")

Generating embeddings with Gemini...

Comparing 'The cat sat on the mat.' and 'A feline rested on the rug.'...
Cosine Similarity: 0.8778

Comparing 'The cat sat on the mat.' and 'The car drove on the highway.'...
Cosine Similarity: 0.6581


In [9]:


# --- 2. Define helper functions for Gemini ---
def get_embedding(text: str or List[str], model: str = "models/embedding-001") -> np.ndarray:
    """
    Creates embeddings for a given text or list of texts using the Gemini API.

    Args:
        text (str or List[str]): The text(s) to embed.
        model (str): The Gemini embedding model to use.

    Returns:
        np.ndarray: The embedding vector(s) as a NumPy array.
    """
    try:
        response = genai.embed_content(model=model, content=text)
        # Handle single and multiple text inputs
        if isinstance(text, str):
            return np.array(response['embedding'], dtype=np.float32)
        else:
            return np.array(response['embedding'], dtype=np.float32)
    except Exception as e:
        print(f"An error occurred during embedding: {e}")
        return np.array([])

def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """Computes cosine similarity between two NumPy vectors."""
    dot_product = np.dot(vec1, vec2)
    norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)
    return dot_product / norm_product if norm_product != 0 else 0.0

# --- 3. Your core retrieval function, adapted for Gemini ---
def retrieve_relevant_chunks(query: str, text_chunks: List[str], chunk_embeddings: List[np.ndarray], k: int = 5) -> List[str]:
    """
    Retrieves the top-k most relevant text chunks.

    Args:
    query (str): User query.
    text_chunks (List[str]): List of text chunks.
    chunk_embeddings (List[np.ndarray]): Embeddings of text chunks.
    k (int): Number of top chunks to return.

    Returns:
    List[str]: Most relevant text chunks.
    """
    # Generate an embedding for the query using the Gemini-specific helper function
    query_embedding = get_embedding(query)

    if query_embedding.size == 0:
        return []

    # Calculate cosine similarity between the query embedding and each chunk embedding
    similarities = [cosine_similarity(query_embedding, emb) for emb in chunk_embeddings]
    
    # Get the indices of the top-k most similar chunks
    top_indices = np.argsort(similarities)[-k:][::-1]
    
    # Return the top-k most relevant text chunks
    return [text_chunks[i] for i in top_indices]

# --- 4. Main Logic Example ---
if __name__ == "__main__":
    # Simulate text chunks and their embeddings
    text_chunks = [
        "Artificial intelligence is a branch of computer science that deals with creating intelligent machines.",
        "Quantum computing harnesses quantum phenomena to perform computations beyond classical computers.",
        "Machine learning is a subfield of AI that provides systems the ability to automatically learn and improve from experience.",
        "The sun is the star at the center of the Solar System.",
        "Robotics is an interdisciplinary field that integrates computer science and engineering."
    ]
    
    print("Creating chunk embeddings...")
    chunk_embeddings = get_embedding(text_chunks)
    
    if chunk_embeddings.size > 0:
        user_query = "What is the relationship between machine learning and AI?"
        print(f"\nSearching for chunks relevant to: '{user_query}'...")
        
        top_results = retrieve_relevant_chunks(user_query, text_chunks, chunk_embeddings, k=2)
        
        print("\nTop 2 most relevant chunks:")
        for i, result in enumerate(top_results):
            print(f"[{i+1}] {result}")
    else:
        print("Failed to create embeddings, cannot perform search.")

Creating chunk embeddings...

Searching for chunks relevant to: 'What is the relationship between machine learning and AI?'...

Top 2 most relevant chunks:
[1] Machine learning is a subfield of AI that provides systems the ability to automatically learn and improve from experience.
[2] Artificial intelligence is a branch of computer science that deals with creating intelligent machines.


In [10]:
import os
import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# --- 1) Load your validation data ---
val_path = '/Users/kekunkoya/Desktop/ISEM 770 Class Project/val.json'
if not os.path.isfile(val_path):
    raise FileNotFoundError(f"Could not find validation file at: {val_path!r}")
with open(val_path, 'r', encoding='utf-8') as f:
    data = json.load(f)
query = data[3]['question']

# --- 2) Define your chunk sizes and sample text/embeddings dicts ---
#    Replace these with your actual chunks and embeddings.
chunk_sizes = [128, 256, 512]
text_chunks_dict = {
    size: ["chunk1 for size "+str(size), "chunk2 for size "+str(size)]  # ← REPLACE
    for size in chunk_sizes
}
# here we simulate embeddings as random vectors; replace with your real embeddings
chunk_embeddings_dict = {
    size: np.random.rand(len(text_chunks_dict[size]), 768)
    for size in chunk_sizes
}

# --- 3) Define retrieval function inline ---
def retrieve_relevant_chunks(query: str,
                             chunks: list[str],
                             embeddings: np.ndarray,
                             top_k: int = 5) -> list[str]:
    """
    Example using cosine similarity. Replace `embed_query`
    with however you turn your query into a vector.
    """
    # --- a) Embed the query (stubbed as random) ---
    #    Replace this with your real query embedding call!
    query_emb = np.random.rand(1, embeddings.shape[1])
    
    # --- b) Compute similarities and pick top_k ---
    sims = cosine_similarity(query_emb, embeddings)[0]
    top_idx = np.argsort(sims)[::-1][:top_k]
    return [chunks[i] for i in top_idx]

# --- 4) Run retrieval across sizes ---
retrieved_chunks_dict = {
    size: retrieve_relevant_chunks(
        query,
        text_chunks_dict[size],
        chunk_embeddings_dict[size]
    )
    for size in chunk_sizes
}

# --- 5) Print results for size=256 ---
print("Top chunks for chunk_size=256:")
print(retrieved_chunks_dict.get(256, 'No chunks for 256'))


Top chunks for chunk_size=256:
['chunk1 for size 256', 'chunk2 for size 256']


## Generating a Response Based on Retrieved Chunks
Let's  generate a response based on the retrieved text for chunk size `256`.

In [11]:


# 1) Define the system prompt
system_prompt = (
    "You are an AI assistant that strictly answers based on the given context. "
    "If the answer cannot be derived directly from the provided context, "
    "respond with: 'I do not have enough information to answer that.'"
)

# 2) Define the response generator for Gemini
def generate_response(query: str, system_prompt: str, retrieved_chunks: List[str], model: str = "gemini-1.5-flash") -> str:
    """
    Generates an AI response based on retrieved chunks using the Gemini API.
    """
    # Combine retrieved chunks into a single context string
    context = "\n\n".join([f"Context {i+1}:\n{chunk}" 
                            for i, chunk in enumerate(retrieved_chunks)])
    
    # Build the full user prompt, including the context and the question
    user_prompt = f"{context}\n\nQuestion: {query}"
    
    # Create the model instance with the system prompt
    try:
        gemini_model = genai.GenerativeModel(model, system_instruction=system_prompt)
        # Call the generate_content endpoint
        response = gemini_model.generate_content(user_prompt)
        return response.text
    except Exception as e:
        print(f"An error occurred during response generation: {e}")
        return "Error: Could not generate a response."

# 3) (Re)define your query and retrieved_chunks_dict
#    For this example, we'll use simulated data.
query = "What are the key attributes of a helpful AI assistant?"
chunk_sizes = [128, 256, 512]
retrieved_chunks_dict = {
    128: ["A helpful assistant is responsive and polite.", "It provides concise and relevant answers."],
    256: ["A helpful assistant is designed to provide information in a clear and organized manner, demonstrating attributes like being concise, relevant, and accurate based on provided data."],
    512: ["A helpful assistant is built on a foundation of ethical principles, ensuring it is fair, unbiased, and respectful. It processes information efficiently and can adapt its communication style to the user's needs. The core attributes include providing concise, relevant, and accurate answers directly from the given context."]
}

# 4) Generate responses for each chunk size
ai_responses_dict = {
    size: generate_response(query, system_prompt, retrieved_chunks_dict[size])
    for size in chunk_sizes
}

# 5) Print the response for chunk size 256
print("AI response (chunk_size=256):")
print(ai_responses_dict.get(256, "No response for size 256"))

AI response (chunk_size=256):
Based on the provided context, the key attributes of a helpful AI assistant are being concise, relevant, and accurate.



## Evaluating the AI Response
We score responses based on faithfulness and relevancy using powerfull llm

In [12]:
# Define evaluation scoring system constants
SCORE_FULL = 1.0     # Complete match or fully satisfactory
SCORE_PARTIAL = 0.5  # Partial match or somewhat satisfactory
SCORE_NONE = 0.0     # No match or unsatisfactory

In [13]:
# Define strict evaluation prompt templates
FAITHFULNESS_PROMPT_TEMPLATE = """
Evaluate the faithfulness of the AI response compared to the true answer.
User Query: {question}
AI Response: {response}
True Answer: {true_answer}

Faithfulness measures how well the AI response aligns with facts in the true answer, without hallucinations.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely faithful, no contradictions with true answer
    * {partial} = Partially faithful, minor contradictions
    * {none} = Not faithful, major contradictions or hallucinations
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""

In [14]:
RELEVANCY_PROMPT_TEMPLATE = """
Evaluate the relevancy of the AI response to the user query.
User Query: {question}
AI Response: {response}

Relevancy measures how well the response addresses the user's question.

INSTRUCTIONS:
- Score STRICTLY using only these values:
    * {full} = Completely relevant, directly addresses the query
    * {partial} = Partially relevant, addresses some aspects
    * {none} = Not relevant, fails to address the query
- Return ONLY the numerical score ({full}, {partial}, or {none}) with no explanation or additional text.
"""

In [15]:
import os
import json
import google.generativeai as genai
from typing import Tuple

# --- 0) Initialize Gemini client (make sure GOOGLE_API_KEY is set) ---
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if not GOOGLE_API_KEY:
    raise ValueError("GOOGLE_API_KEY environment variable is not set.")
genai.configure(api_key=GOOGLE_API_KEY)

# --- 1) Load your validation data ---
val_path = '/Users/kekunkoya/Desktop/ISEM 770 Class Project/val.json'
if not os.path.isfile(val_path):
    raise FileNotFoundError(f"Could not find {val_path!r}")
with open(val_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

# Use simulated data for a runnable example
query = data[3]['question']
true_answer = data[3]['ideal_answer']

# 2) Define your prompts & scoring constants
SCORE_FULL = "1.0"
SCORE_PARTIAL = "0.5"
SCORE_NONE = "0.0"

FAITHFULNESS_PROMPT_TEMPLATE = """
Evaluate the FAITHFULNESS of the assistant’s response given the true answer.
Question: {question}
Response: {response}
True Answer: {true_answer}
Return:
- {full} if the response is fully supported by the true answer.
- {partial} if it’s partially supported.
- {none} if it’s unsupported.
Just output ONLY the numeric score.
"""

RELEVANCY_PROMPT_TEMPLATE = """
Evaluate the RELEVANCY of the assistant’s response to the user’s question.
Question: {question}
Response: {response}
Return:
- {full} if the response directly addresses the question.
- {partial} if it somewhat addresses it.
- {none} if it does not address it.
Just output ONLY the numeric score.
"""

# 3) Define the evaluator
def evaluate_response(question: str, response: str, true_answer: str) -> Tuple[float, float]:
    """
    Evaluates a response using the Gemini API for faithfulness and relevancy.
    """
    # A single prompt is created by combining the system instruction and the user content
    # This is a common pattern when using Gemini in this way.
    faith_prompt = FAITHFULNESS_PROMPT_TEMPLATE.format(
        question=question,
        response=response,
        true_answer=true_answer,
        full=SCORE_FULL,
        partial=SCORE_PARTIAL,
        none=SCORE_NONE
    )
    rel_prompt = RELEVANCY_PROMPT_TEMPLATE.format(
        question=question,
        response=response,
        full=SCORE_FULL,
        partial=SCORE_PARTIAL,
        none=SCORE_NONE
    )
    
    # Use a single Gemini model for both evaluations
    eval_model = genai.GenerativeModel('gemini-1.5-flash', system_instruction="You are an objective evaluator. Return ONLY the numeric score.")

    # Ask the LLM for faithfulness score
    try:
        faith_resp = eval_model.generate_content(faith_prompt)
        faith_score = float(faith_resp.text.strip())
    except (ValueError, Exception) as e:
        print(f"Error parsing faithfulness score: {e}")
        faith_score = 0.0

    # Ask the LLM for relevancy score
    try:
        rel_resp = eval_model.generate_content(rel_prompt)
        rel_score = float(rel_resp.text.strip())
    except (ValueError, Exception) as e:
        print(f"Error parsing relevancy score: {e}")
        rel_score = 0.0

    return faith_score, rel_score

# 4) Simulate `ai_responses_dict`
ai_responses_dict = {
    256: "The true answer is provided by the validation data for the question.",
    128: "The answer is partially supported by the true answer."
}
chunk_sizes = [128, 256]

# 5) Evaluate for chunk sizes 256 and 128
faith256, rel256 = evaluate_response(query, ai_responses_dict[256], true_answer)
faith128, rel128 = evaluate_response(query, ai_responses_dict[128], true_answer)

print(f"Faithfulness (256): {faith256}, Relevancy (256): {rel256}")
print(f"Faithfulness (128): {faith128}, Relevancy (128): {rel128}")

Faithfulness (256): 1.0, Relevancy (256): 0.0
Faithfulness (128): 0.5, Relevancy (128): 0.0
