## Exploring the ideal chunk size for RAG system usin LlamaIndex and Evaluation.

#### Introduction:

Finding an ideal chunk size that workd for your application is one of the crucial initial steps in building RAG applications. Chunk size decides from getting all required context for llm to adding noise that would deprecate the llm response quality. Hence, strking the balance here is really vital.

Here, using the evaluations from LlamaIndex, we explore the different chunk size and their Faithfulness, Relavancy, and ResponseTime to determine the opt chunk size for the given application.

Retrieval-augmented generation (RAG) has introduced an innovative approach that fuses the extensive retrieval capabilities of search systems with the LLM. When implementing a RAG system, one critical parameter that governs the system’s efficiency and performance is the chunk_size. How does one discern the optimal chunk size for seamless retrieval? This is where LlamaIndex Response Evaluation comes handy. In this blogpost, we'll guide you through the steps to determine the best chunk size using LlamaIndex’s Response Evaluation module. If you're unfamiliar with the Response Evaluation module, we recommend reviewing its documentation before proceeding.


#### Why Chunk Size Matters
Choosing the right chunk_size is a critical decision that can influence the efficiency and accuracy of a RAG system in several ways:

**Relevance and Granularity:** A small chunk_size, like 128, yields more granular chunks. This granularity, however, presents a risk: vital information might not be among the top retrieved chunks, especially if the similarity_top_k setting is as restrictive as 2. Conversely, a chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available. To navigate this, we employ the Faithfulness and Relevancy metrics. These measure the absence of ‘hallucinations’ and the ‘relevancy’ of responses based on the query and the retrieved contexts respectively.

**Response Generation Time:** As the chunk_size increases, so does the volume of information directed into the LLM to generate an answer. While this can ensure a more comprehensive context, it might also slow down the system. Ensuring that the added depth doesn't compromise the system's responsiveness is crucial.

In essence, determining the optimal chunk_size is about striking a balance: capturing all essential information without sacrificing speed. It's vital to undergo thorough testing with various sizes to find a configuration that suits the specific use-case and dataset.

### Setup

Before starting with the experiment on chunk size, let's install the required libraries from requirements.txt

```bash
!pip install -r requirements.txt


### Importing the libraries

In [None]:
import nest_asyncio

nest_asyncio.apply()

from llama_index.core import (SimpleDirectoryReader,
                                VectorStoreIndex,
                                Settings, 
                                Document)

from llama_index.core.evaluation import (FaithfulnessEvaluator,
                                    DatasetGenerator,
                                    RelevancyEvaluator)

from llama_index.core.node_parser import SentenceSplitter

from llama_index.embeddings.openai import OpenAIEmbedding 
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding

from llama_index.llms.openai import OpenAI

from llama_index.llms.google_genai import GoogleGenAI

import openai
import time
from dotenv import load_dotenv
import os


load_dotenv()

if os.getenv("OPENAI_API_KEY"):
    print("Key present")


### Data Preparation

Let's download the data from the Uber 10K SEC Filings for 2021 for this experiment.

In [None]:
import os
os.makedirs("data/10k", exist_ok=True)
!curl -L https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/uber_2021.pdf -o data/10k/uber_2021.pdf

### Load Data

In [None]:
loader = SimpleDirectoryReader("./data/10k/")
documents = loader.load_data()

In [None]:
#Let's check the number of documents loaded
print(f"The total number of documents loaded {len(documents)}")

### Generate Questions for Evaluation

To select the right chunk_size, we'll compute metrics like Average Response time, Faithfulness, and Relevancy for various chunk_sizes. The DatasetGenerator will help us generate questions from the documents.

In [None]:
#Let's generate questions to use in evaluation
eval_documents = documents[:20]
ques_generator = DatasetGenerator.from_documents(eval_documents)
eval_questions = ques_generator.generate_questions_from_nodes(num=40)

In [None]:
#Let's preview some sample questions generated
print(f"Total number of questions generated {len(eval_questions)}  - {type(eval_questions)}")
print("Sample Preview of Generated Evaluation Questions:")
for i, question in enumerate(eval_questions[:10],1):
    print(f"{i}. {question}")

### Setting up Evaluation:

In order to evaluate the RAG we use the evaluation metircs **Faithfulness** and **Relevancy**. Let's use the **GPT - 4** model for evaluation purpose, here the **GPT-4** will evaluate the response from **GPT03.5 Turbo** model.

**Faithfulness** - To measure if the response is hallucinated or it is based on the ground truth(Context) provided to the model.

**Relevancy** - To measure if the response is addressing the actual query also it measures if the response + source_nodes are match for the query.

In [None]:
#Let's initialize the model and evaluation metrics

Settings.llm = OpenAI(model = 'gpt-4o-mini', temperature=0.2)

faithfulness = FaithfulnessEvaluator(llm=Settings.llm)

relevancy = RelevancyEvaluator(llm=Settings.llm)


### Function to Evaluate for chunk size

In [1]:
def evaluate_response_time_accuracy(chunk_size:int,eval_queries:list)-> tuple:
    """Evaluate the average response time, faithfulness and relevancy for different chunk size 
    Evaluation Model : GPT-4
    Response from the model GPT-3.5-turbo

    Input Parameter:
    chunk_size : int - The size of the data chunks

    Output Parameter:
    Tuple - average response time, faithfulness and relevancy
    """
    total_time = 0
    total_faihtfulness = 0
    total_relevancy = 0

    total_questions = len(eval_queries)

    print(f"Total number of qureries to evaluate {total_questions}")


    res_llm = OpenAI(model='gpt-4o-mini', temperature=0.2)
    chunk_overlap = int(chunk_size*0.2)

    Settings.text_splitter=SentenceSplitter(chunk_size = chunk_size,
    chunk_overlap = chunk_overlap)

    Settings.embed_model = OpenAIEmbedding(model='text-embedding-3-small', dimesions=1536)

    vector_store = VectorStoreIndex(eval_documents,show_progress=True)

    engine=vector_store.as_query_engine(similarity_top_k=5, response_mode='compact')

    for question in eval_queries:

        time.sleep(3)
        
        start_time = time.time()
        
        response = engine.query(question)

        elapsed_time = time.time() - start_time

        time.sleep(3)

        faithfulness_result = faithfulness.evaluate_response(response=response)
        # print(f"faithfulness : {faithfulness_result.score} of type {type(faithfulness_result)}")

        time.sleep(3)

        relevancy_result = relevancy.evaluate_response(query=question, response=response)

        # print(f"relevancy : {relevancy_result.score} of type :{type(relevancy_result)}")

        total_time += elapsed_time
        total_faihtfulness += faithfulness_result.score
        total_relevancy += relevancy_result.score

    avg_response_time = total_time / total_questions
    avg_faithfulness = total_faihtfulness / total_questions
    avg_relevancy = total_relevancy / total_questions

    return avg_response_time,avg_faithfulness,avg_relevancy

    







### Testing Across Different Chunk Sizes
We'll evaluate a range of chunk sizes [128, 256, 512, 1024, 2048] to identify which offers the most promising metrics

In [None]:
chunk_sizes = [128, 256, 512, 1024, 2048]
eval_queries = eval_questions

print(eval_queries)


In [None]:

for chunk_size in chunk_sizes:
    avg_response_time, avg_faithfulness, avg_relevancy = evaluate_response_time_accuracy(chunk_size, eval_queries)
    print(f"The Metrics for chunk size {chunk_size} are Average reponse time {avg_response_time}, Averaage Faithfulness {avg_faithfulness} and Average Relevancy {avg_relevancy}")

### Debugging step to see all available google genai models and their path

In [None]:
from google import genai

gemi_key = os.getenv("GOOGLE_API_KEY")

client = genai.Client(api_key=gemi_key)

models = client.models.list()

for model in models:
    print(model.name)

In [57]:
import openai

openai_key = os.getenv("OPENAI_API_KEY")

client=openai.Client(api_key=openai_key)

models=client.models.list()

for model in models:
    print(model.id)

2026-02-26 16:06:24,709 - INFO - HTTP Request: GET https://api.openai.com/v1/models "HTTP/1.1 200 OK"


gpt-4-0613
gpt-4
gpt-3.5-turbo
gpt-4o-search-preview-2025-03-11
gpt-5.3-codex
gpt-realtime-1.5
gpt-audio-1.5
gpt-4o-search-preview
davinci-002
babbage-002
gpt-3.5-turbo-instruct
gpt-3.5-turbo-instruct-0914
dall-e-3
dall-e-2
gpt-4-1106-preview
gpt-3.5-turbo-1106
tts-1-hd
tts-1-1106
tts-1-hd-1106
text-embedding-3-small
text-embedding-3-large
gpt-4-0125-preview
gpt-4-turbo-preview
gpt-3.5-turbo-0125
gpt-4-turbo
gpt-4-turbo-2024-04-09
gpt-4o
gpt-4o-2024-05-13
gpt-4o-mini-2024-07-18
gpt-4o-mini
gpt-4o-2024-08-06
gpt-4o-audio-preview
gpt-4o-realtime-preview
omni-moderation-latest
omni-moderation-2024-09-26
gpt-4o-realtime-preview-2024-12-17
gpt-4o-audio-preview-2024-12-17
gpt-4o-mini-realtime-preview-2024-12-17
gpt-4o-mini-audio-preview-2024-12-17
o1-2024-12-17
o1
gpt-4o-mini-realtime-preview
gpt-4o-mini-audio-preview
o3-mini
o3-mini-2025-01-31
gpt-4o-2024-11-20
gpt-4o-mini-search-preview-2025-03-11
gpt-4o-mini-search-preview
gpt-4o-transcribe
gpt-4o-mini-transcribe
o1-pro-2025-03-19
o1-pro
