## 📊 **Evaluating Chunk Size Impact in RAG Pipelines** | **RAG100x**

This notebook introduces an automated system to **evaluate the quality of RAG pipelines** by analyzing how different **chunk sizes** affect performance — a key step for building **production-ready** RAG systems.

Unlike the previous three notebooks where we focused on **building RAG applications** from various data sources (PDFs, CSVs, web articles), here we take a step further by asking:  
**“How do we know if our RAG system is actually working well?”**

We use **LlamaIndex**, **OpenAI embeddings**, and structured **GPT-based evaluators** to:

- ✅ Benchmark RAG output across chunk sizes
- ✅ Score for faithfulness (hallucination detection)
- ✅ Measure relevance and answer quality
- ✅ Track retrieval + generation latency

This transition from building to **evaluating and optimizing** makes the system more **robust, testable, and production-aware**.

> 🛠️ All components are implemented inline for clarity, transparency, and customization.


### 📦 Installing Core Libraries
- **`langchain` & `langchain-community`**  
  Provides standardized interfaces for document loaders, splitters, embedding models, vectorstores, and LLM chains — including community-maintained integrations.

- **`python-dotenv`**  
  Helps manage API credentials securely by loading them from a `.env` file into environment variables.

> We intentionally keep dependencies lightweight and modular to retain full control over the pipeline and ensure reproducibility in future experiments.


In [None]:
# Install required packages
!pip install llama-index openai python-dotenv

## 🧰 Getting Things Ready (Libraries + API Setup)

Before we start building anything, we need to load a few tools and make sure our environment is set up properly. Here's what each part does:

- **`nest_asyncio`**  
  This is a helper that allows certain parts of Python to work better inside Jupyter notebooks or Google Colab. Some tools (like LLMs or web loaders) use background tasks called *"async"*, and this makes sure they don't crash or conflict in a notebook environment.

- **`dotenv` and `os`**  
  Instead of typing your OpenAI API key directly in the notebook (which is risky), we keep it in a hidden `.env` file. These two libraries help us *load that key securely* so we can access OpenAI’s models safely.

- **`llama_index`**  
  This is a powerful tool that helps connect documents (like blog posts) with large language models (LLMs). We use it to:
  - Read and organize the text from files  
  - Break long text into smaller pieces (chunks)  
  - Build a searchable index  
  - Generate questions and evaluate the answers  
  - Customize how the model should respond using special prompts  

- **`openai`**  
  This library talks directly to OpenAI’s models like GPT-3.5 and GPT-4. We use it behind the scenes for generating answers and checking how accurate or relevant they are.

- **`load_dotenv()` + `openai.api_key = os.getenv(...)`**  
  These lines *load your secret API key* from the `.env` file and make it available to the notebook. Without this, none of the OpenAI features would work.

> 🟢 In short: This block makes sure everything is set up correctly — the tools are imported, your OpenAI key is safely loaded, and the notebook is ready to run RAG workflows with LlamaIndex and OpenAI.


In [None]:
import nest_asyncio
import random

nest_asyncio.apply()
from dotenv import load_dotenv

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.prompts import PromptTemplate

from llama_index.core.evaluation import (
    DatasetGenerator,
    FaithfulnessEvaluator,
    RelevancyEvaluator
)
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

import openai
import time
import os
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

## 📂 Load the Source Documents

- **What this does**  
  This step loads all the source documents that our RAG system will use. These documents are stored in a `data/` folder and could include text, JSON, or other readable formats.

- **`data_dir = "../data"`**  
  This specifies the directory containing the documents we want to load. In this case, it's a folder named `data` located one level above the notebook.

- **`SimpleDirectoryReader(data_dir).load_data()`**  
  A utility from **LlamaIndex** that automatically reads all supported files from the specified directory and loads them as a list of text documents. This makes it easy to bring in external knowledge without manual parsing.

> 📎 This is the entry point for your knowledge base. Once loaded, these documents can be embedded, indexed, and retrieved during question answering.


In [None]:
data_dir = "../data"
documents = SimpleDirectoryReader(data_dir).load_data()

## 🧪 Generate Evaluation Questions Automatically

- **What this does**  
  Instead of writing evaluation questions manually, we use LlamaIndex's built-in `DatasetGenerator` to auto-generate realistic questions from our documents. This is useful for benchmarking how well the RAG pipeline performs across different queries.

- **`eval_documents = documents[0:20]`**  
  Selects the first 20 documents from our dataset for generating questions. Using a subset keeps evaluation lightweight while still being meaningful.

- **`DatasetGenerator.from_documents(...)`**  
  This initializes a generator that scans the given documents and prepares to create evaluation-ready questions.

- **`generate_questions_from_nodes()`**  
  Actually generates question prompts based on the selected documents' content. These questions simulate real user queries grounded in the source material.

- **`random.sample(..., num_eval_questions)`**  
  From the full list of generated questions, we randomly pick a specified number (`num_eval_questions = 25`) to create a diverse and manageable evaluation set.

> 🎯 This step helps us create a synthetic QA benchmark for evaluating our RAG system’s accuracy and grounding — a critical practice before deploying any real-world RAG.


In [None]:
num_eval_questions = 25

eval_documents = documents[0:20]
data_generator = DatasetGenerator.from_documents(eval_documents)
eval_questions = data_generator.generate_questions_from_nodes()
k_eval_questions = random.sample(eval_questions, num_eval_questions)

## 📏 Define Evaluation Metrics using GPT-4

To ensure our RAG system is not only retrieving relevant information but also generating factually grounded responses, we use two evaluation metrics:

### ✅ 1. **Faithfulness Evaluation**
- Checks **if the generated answer is directly supported by the retrieved context**.
- Based on GPT-4’s reasoning, it gives a **binary judgment**: `YES` (faithful) or `NO` (hallucinated).
- We define a **custom prompt template** with clear examples to guide the LLM. This ensures more **robust and interpretable evaluations** compared to a generic instruction.

> 🎯 Why customize the faithfulness prompt?  
Default evaluators may be too vague. By explicitly stating how GPT-4 should evaluate grounding and providing examples (e.g., about apple pie or Paris), we make evaluations **more consistent and reliable**.

---

### 🔍 2. **Relevancy Evaluation**
- Judges whether the retrieved context is **relevant to the input query**.
- Also uses GPT-4 (`gpt-4o`) to score how well the documents match the user question.
- Helps measure **retriever performance** independently of the final answer quality.

---

### ⚙️ Model & Settings
- We configure all evaluations to use **`gpt-4o`**, a faster and more cost-efficient variant of GPT-4 with strong reasoning skills.
- This model is injected into LlamaIndex’s global `Settings.llm`, ensuring all evaluators default to this model without re-specifying it each time.

> 🛠️ These evaluators help **quantify hallucinations and retrieval mismatch**, making the RAG system more trustworthy and production-ready.


In [None]:
# We will use GPT-4 for evaluating the responses
gpt4 = OpenAI(temperature=0, model="gpt-4o")

# Set appropriate settings for the LLM
Settings.llm = gpt4

# Define Faithfulness Evaluators which are based on GPT-4
faithfulness_gpt4 = FaithfulnessEvaluator()

faithfulness_new_prompt_template = PromptTemplate(""" Please tell if a given piece of information is directly supported by the context.
    You need to answer with either YES or NO.
    Answer YES if any part of the context explicitly supports the information, even if most of the context is unrelated. If the context does not explicitly support the information, answer NO. Some examples are provided below.

    Information: Apple pie is generally double-crusted.
    Context: An apple pie is a fruit pie in which the principal filling ingredient is apples.
    Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard, or cheddar cheese.
    It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).
    Answer: YES

    Information: Apple pies taste bad.
    Context: An apple pie is a fruit pie in which the principal filling ingredient is apples.
    Apple pie is often served with whipped cream, ice cream ('apple pie à la mode'), custard, or cheddar cheese.
    It is generally double-crusted, with pastry both above and below the filling; the upper crust may be solid or latticed (woven of crosswise strips).
    Answer: NO

    Information: Paris is the capital of France.
    Context: This document describes a day trip in Paris. You will visit famous landmarks like the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.
    Answer: NO

    Information: {query_str}
    Context: {context_str}
    Answer:

    """)

faithfulness_gpt4.update_prompts({"your_prompt_key": faithfulness_new_prompt_template}) # Update the prompts dictionary with the new prompt template

# Define Relevancy Evaluators which are based on GPT-4
relevancy_gpt4 = RelevancyEvaluator()

## 🧪 Evaluate Chunk Size Impact on RAG Accuracy & Speed

To assess how **different chunk sizes** affect both **retrieval quality** and **response latency**, we define a custom evaluation function.

### 🔧 Function: `evaluate_response_time_and_accuracy(chunk_size, eval_questions)`

This function:
- Runs a RAG pipeline using the specified `chunk_size`.
- Uses **GPT-3.5-Turbo** for generating answers (cost-efficient and fast).
- Uses **GPT-4** to evaluate each answer for:
  - **Faithfulness** (is it grounded in the retrieved context?)
  - **Relevancy** (is the context relevant to the query?)
- Measures the **response time** per query.

### 📊 What it Measures:
- **⏱️ Average Response Time** — Time taken to generate answers.
- **✅ Faithfulness Score** — How factual the generated responses are, based on retrieved context.
- **🔍 Relevancy Score** — How relevant the retrieved context is to the query.

> 💡 Why this matters:  
In real-world RAG systems, there's always a **tradeoff between chunk size and performance**:
- Large chunks = richer context but slower and harder to retrieve precisely.
- Small chunks = faster and more targeted, but may lack enough info.

This function helps us **quantify that tradeoff** and choose optimal chunk sizes for production use.

---

📌 *Note:*  
While `BatchEvalRunner` from LlamaIndex can evaluate faster in bulk, we use a **for-loop** here to individually measure **response time** for each question, which is crucial for latency benchmarking.


In [None]:
# Define function to calculate average response time, average faithfulness and average relevancy metrics for given chunk size
# We use GPT-3.5-Turbo to generate response and GPT-4 to evaluate it.
def evaluate_response_time_and_accuracy(chunk_size, eval_questions):
    """
    Evaluate the average response time, faithfulness, and relevancy of responses generated by GPT-3.5-turbo for a given chunk size.
    
    Parameters:
    chunk_size (int): The size of data chunks being processed.
    
    Returns:
    tuple: A tuple containing the average response time, faithfulness, and relevancy metrics.
    """

    total_response_time = 0
    total_faithfulness = 0
    total_relevancy = 0

    # create vector index
    llm = OpenAI(model="gpt-3.5-turbo")

    Settings.llm = llm
    Settings.chunk_size = chunk_size
    Settings.chunk_overlap = chunk_size // 5 

    vector_index = VectorStoreIndex.from_documents(eval_documents)
    
    # build query engine
    query_engine = vector_index.as_query_engine(similarity_top_k=5)
    num_questions = len(eval_questions)

    # Iterate over each question in eval_questions to compute metrics.
    # While BatchEvalRunner can be used for faster evaluations (see: https://docs.llamaindex.ai/en/latest/examples/evaluation/batch_eval.html),
    # we're using a loop here to specifically measure response time for different chunk sizes.
    for question in eval_questions:
        start_time = time.time()
        response_vector = query_engine.query(question)
        elapsed_time = time.time() - start_time
        
        faithfulness_result = faithfulness_gpt4.evaluate_response(
            response=response_vector
        ).passing
        
        relevancy_result = relevancy_gpt4.evaluate_response(
            query=question, response=response_vector
        ).passing

        total_response_time += elapsed_time
        total_faithfulness += faithfulness_result
        total_relevancy += relevancy_result

    average_response_time = total_response_time / num_questions
    average_faithfulness = total_faithfulness / num_questions
    average_relevancy = total_relevancy / num_questions

    return average_response_time, average_faithfulness, average_relevancy

## 🧪 Benchmarking Different Chunk Sizes

In this step, we test how different **chunk sizes** impact:

- 📉 **Latency** (response time)
- ✅ **Faithfulness** (factual correctness)
- 🔍 **Relevancy** (how on-topic the retrieved context is)

### 🧪 Chunk Sizes Tested:
We loop over a list of chunk sizes:

In [None]:

chunk_sizes = [128, 256]

for chunk_size in chunk_sizes:
  avg_response_time, avg_faithfulness, avg_relevancy = evaluate_response_time_and_accuracy(chunk_size, k_eval_questions)
  print(f"Chunk size {chunk_size} - Average Response time: {avg_response_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")

---

## 📘 Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models — as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I’ve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I’ve added clear, concise markdowns throughout the notebook — explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It’s designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.

## 📊 Why Evaluate Chunk Sizes in RAG?

Chunking is one of the **most underrated performance levers** in Retrieval-Augmented Generation (RAG). The size of the text chunks directly impacts:

- 🧠 **Context relevance** — Too small, and context gets fragmented. Too large, and noise overwhelms signal.
- ⚡ **Inference latency** — Smaller chunks are faster to retrieve and rank. Larger chunks can slow down LLM processing.
- 🎯 **Accuracy** — Faithfulness and relevancy both fluctuate depending on how well the chunks align with the query intent.

In this notebook, I:
- Loaded a document corpus and generated **25 evaluation questions**
- Used **GPT-3.5** to generate answers and **GPT-4** to grade them
- Evaluated how response **latency, faithfulness, and relevancy** vary across different chunk sizes (128, 256)
- Built a **custom evaluator function** to measure all 3 metrics for any chunk size

This approach gives **quantitative evidence** to guide chunk tuning in real-world RAG systems.

---

## 🧠 What’s New in This Version?

Compared to my previous RAG builds, this version focuses heavily on **automated evaluation pipelines** using modern LLM tools:

- 📐 **Metric-driven architecture tuning** — Measures real-world tradeoffs using structured prompts and eval loops  
- 🧪 **Faithfulness and Relevancy grading with GPT-4** — Ensures model answers are grounded in retrieved chunks  
- 🔁 **Iterative chunking experiments** — Runs the same evaluation logic over multiple chunk sizes for easy comparison  
- 🔍 **Custom faithfulness prompt** — Replaces the default LlamaIndex template with a stricter, context-sensitive scoring rubric  
- 📦 **All logic in-notebook** — No reliance on external helper modules — everything is transparent and reproducible

This project brings me one step closer to building **production-grade, evaluation-first RAG systems**.

---

## 🚀 What Could Be Added Next?

Here are some high-impact ideas to build on this foundation:

- 🧩 **More granular chunking strategies** — Try sentence-based splits, semantic splitting, or recursive chunking with windowed overlap  
  *Go beyond static sizes — explore dynamic methods like sentence boundaries or chunk re-ranking.*

- 🔁 **Multiple eval rounds per chunk size** — Average metrics over several random samples for higher statistical confidence  
  *Improves robustness of conclusions — less noise from specific question sets.*

- 📈 **Visualize results** — Plot latency vs. accuracy to find the ideal operating point  
  *Use matplotlib or seaborn to produce tradeoff curves and identify sweet spots.*

- 🧠 **Expand eval questions** — Use LLMs to generate multi-hop, open-ended, and factoid queries  
  *Stress-test the retriever + generator stack with more diverse question types.*

- ✅ **Integrate LLM-as-a-Judge frameworks** — Use OpenAI function-calling or LangChain’s `LLMGraderChain` for cleaner eval orchestration  
  *This would standardize evaluation across multiple projects.*

- 🖼️ **Deploy with UI** — Add a Streamlit dashboard to control chunk size, ask queries, and view live metrics  
  *Useful for showcasing internal benchmarks or making decisions interactively.*

This notebook sets the stage for rigorous, benchmark-driven RAG optimization — ideal for research and production readiness alike.

## 💡 Final Word

This notebook is part of my larger personal project: **RAG100x** — a challenge to build and log my journney in RAG from 0 100 in the coming months.

It’s not built to impress — it’s built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course — check out the original repository for broader implementations and ideas.