# Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex

## Introduction

- RAG has introduced an innovative approach that fuses the extensive retrieval capabilities of search systems with the LLM. 
- When implementing a RAG system, one critical parameter that governs the system’s efficiency and performance is the chunk_size. 

## How does one discern the optimal chunk size for seamless retrieval? 

- LlamaIndex Response Evaluation comes in handy. 

This script guides you through the steps to determine the best chunk size using LlamaIndex’s Response Evaluation module. 

In [2]:
# Requirements.txt file:

# langchain-community
# langchain-core
# nest-asyncio
# llama_index
# langchain
# llama-index-llms-openai
# llama-index-embeddings-openai
# ollama
# llama-index-llms-langchain
# llama-index-embeddings-langchain
# spacy

## Why Chunk Size Matters

Choosing the right chunk_size is a critical decision that can influence the efficiency and accuracy of a RAG system in several ways:

- Relevance and Granularity: 

    - A small chunk_size, like 128, yields more granular chunks. Risk: Vital information might not be among the top retrieved chunks, especially if the similarity_top_k setting is as restrictive as 2. 

    - A chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available.

    ##### To navigate this, we employ the Faithfulness and Relevancy metrics. 

    - Measures the absence of ‘hallucinations’ & 'relevancy' of responses based on the query and the retrieved contexts respectively.

- Response Generation Time: 

    - As the chunk_size increases, so does the volume of information directed into the LLM to generate an answer. 
    - While this can ensure a more comprehensive context, it might also slow down the system. 
    - Ensuring that the added depth doesn't compromise the system's responsiveness is crucial.

    ##### Determining the optimal chunk_size is about striking a balance: capturing all essential information without sacrificing speed. 

It's vital to undergo thorough testing with various sizes to find a configuration that suits the specific use case and dataset.

Importing the necessary libraries

In [3]:
import nest_asyncio
nest_asyncio.apply()
from llama_index.core.llama_dataset.generator import RagDatasetGenerator
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.core.evaluation import DatasetGenerator, FaithfulnessEvaluator, RelevancyEvaluator
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
import time
import os
from llama_index.core import (ServiceContext,SimpleDirectoryReader,StorageContext,VectorStoreIndex,set_global_service_context)

  from .autonotebook import tqdm as notebook_tqdm


Load the Data

In [4]:
# Load Data
reader = SimpleDirectoryReader("documents\short_story")
documents = reader.load_data()
documents

[Document(id_='2a4682bf-941e-4895-977e-6fc1cfe7f0d8', embedding=None, metadata={'page_label': '1', 'file_name': 'Vanka.pdf', 'file_path': 'c:\\Project_Files\\Langchain\\documents\\short_story\\Vanka.pdf', 'file_type': 'application/pdf', 'file_size': 195740, 'creation_date': '2024-04-10', 'last_modified_date': '2024-04-10'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text=' \n \n \n \n \n \n \n \n \n \nVanka  \nBy Anton Chekhov   \n ', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='c3f34b7d-f0ab-4749-b449-1a65edc319e6', embedding=None, metadata={'page_label': '2', 'file_name': 'Vanka.pdf', 'file_path': 'c:\\Project_Files\\Langchain\\documents\

Question Generation

To select the right chunk_size, we'll compute metrics like Average Response time, Faithfulness, and Relevancy for various chunk_sizes. 

The DatasetGenerator will help us generate questions from the documents.

In [5]:
# Initialize Ollama model with "llama2" configuration.
llm = Ollama(model="llama2")
# Initialize Ollama embeddings.
embeddings = OllamaEmbeddings()
service_context_llama2 = service_context = (ServiceContext.from_defaults(llm=llm,embed_model=embeddings,chunk_size=300))

  service_context_llama2 = service_context = (ServiceContext.from_defaults(llm=llm,embed_model=embeddings,chunk_size=300))


In [6]:
# To evaluate for each chunk size, we will first generate a set of 40 questions from first 4 pages.
eval_documents = documents[:4]

dataset_generator = DatasetGenerator.from_documents(documents, num_questions_per_chunk=2, show_progress=True, service_context=service_context_llama2)
# dataset_generator = RagDatasetGenerator.from_documents(documents,num_questions_per_chunk=2, show_progress=True)

dataset_generator

Parsing nodes: 100%|██████████| 7/7 [00:00<00:00, 522.07it/s]
  return cls(


<llama_index.core.evaluation.dataset_generation.DatasetGenerator at 0x1847b12a250>

#### ISSUE:
Error: listen tcp 127.0.0.1:11434: bind: Only one usage of each socket address (protocol/network address/port) is normally permitted.

ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at  0x000001108EBB9220>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))


#### Solution:

set OLLAMA_HOST=127.0.0.1:11433 

ollama serve

or 

netstat -ano | findstr :<PORT>

taskkill /PID <PID> /F

npx kill-port <PORT>

#### Reason:

The issue could be, check your windows services, while installing ollama as bare metal it might have installed as system service, in this case you dont need to start it as ollama serve because it might be started already


In [7]:
eval_questions = dataset_generator.generate_questions_from_nodes(num = 20)

# eval_questions = dataset_generator.generate_dataset_from_nodes()
# eval_questions = ["Who is the author?, What is the main agenda of the document?, What is the title?"]

eval_questions

  warn_deprecated(
100%|██████████| 20/20 [10:41<00:00, 32.08s/it]  
  return QueryResponseDataset(queries=queries, responses=responses_dict)


['Great! Based on the provided context information, here are two potential questions that could be used for a quiz or examination:',
 'What is the author\'s name of the short story "Vanka"?',
 '* Context clues: The passage mentions the author\'s name in the file name ("By Anton Chekhov"), and the creation date and last modified date are both on April 10, 2024.',
 'What is the file type of the PDF document?',
 '* Context clues: The passage mentions the file type as "application/pdf" and provides the file size, which suggests that it is a PDF document.',
 'Great! Based on the provided context information, here are two questions that I have generated for your upcoming quiz/examination:',
 'What is the name of the short story being read in the passage, and who is the author? (File Name: Vanka.pdf)',
 'When was the short story "Vanka" written, according to the passage? (Page Label: 2)',
 'Of course! Based on the context information provided, here are two potential questions that could be us

In [8]:
len(eval_questions)

20

Setting Up Evaluators

- Setting up the llama2 model to serve as the backbone for evaluating the responses generated during the experiment. 

Two evaluators, FaithfulnessEvaluator and RelevancyEvaluator, are initialised with the service_context .

    - Faithfulness Evaluator — Measures if the response was hallucinated and measures if the response from a query engine matches any source nodes.
    
    - Relevancy Evaluator — Measures if the query was actually answered by the response and measures if the response + source nodes match the query.

In [9]:
# Define Faithfulness and Relevancy Evaluators which are based on GPT-4
faithfulness_llama2 = FaithfulnessEvaluator(service_context=service_context_llama2)
relevancy_llama2 = RelevancyEvaluator(service_context=service_context_llama2)

Response Evaluation For A Chunk Size

We evaluate each chunk_size based on 3 metrics.

    - Average Response Time.

    - Average Faithfulness.
    
    - Average Relevancy.

Here’s a function, evaluate_response_time_and_accuracy, that does just that which has:

    - VectorIndex Creation.

    - Building the Query Engine.
    
    - Metrics Calculation.

In [10]:
# create vector index
llm_test = Ollama(model="mistral")
    
# # Initialize Ollama embeddings.
# embeddings = OllamaEmbeddings()

# service_context = ServiceContext.from_defaults(llm=llm, chunk_size=chunk_size)
# vector_index = VectorStoreIndex.from_documents(eval_documents, service_context=service_context)

In [11]:
# # Define function to calculate average response time, average faithfulness and average relevancy metrics for given chunk size
# def evaluate_response_time_and_accuracy(chunk_size):
#     total_response_time = 0
#     total_faithfulness = 0
#     total_relevancy = 0
    
#     service_context = ServiceContext.from_defaults(llm=llm_test, embed_model=embeddings, chunk_size=chunk_size)
    
#     # Set the global service context for LLAMA components.
#     set_global_service_context(service_context)
    
#     # Parse nodes from the documents using the service context's node parser.
#     nodes = (service_context.node_parser.get_nodes_from_documents(documents))
    
#     # Initialize storage context with default settings.
#     storage_context = StorageContext.from_defaults()
    
#     # Add parsed documents (nodes) to the document store within the storage context.
#     storage_context.docstore.add_documents(nodes)
    
#     # Initialize vector store index from documents, storage context, and LLAMA model.
#     vector_index = VectorStoreIndex.from_documents(eval_documents,storage_context=storage_context,llm=llm_test)
    
#     query_engine = vector_index.as_query_engine(llm=llm_test)
#     num_questions = len(eval_questions)
    
#     for question in eval_questions:
#         start_time = time.time()
#         response_vector = query_engine.query(question)
#         elapsed_time = time.time() - start_time
#         faithfulness_result = faithfulness_llama2.evaluate_response(response=response_vector).passing
#         relevancy_result = relevancy_llama2.evaluate_response(query=question, response=response_vector).passing
        
#         total_response_time += elapsed_time
#         total_faithfulness += faithfulness_result
#         total_relevancy += relevancy_result
        
#     average_response_time = total_response_time / num_questions
#     average_faithfulness = total_faithfulness / num_questions
#     average_relevancy = total_relevancy / num_questions
    
#     return average_response_time, average_faithfulness, average_relevancy

In [12]:
from llama_index.core import Settings
from llama_index.core.node_parser import SentenceSplitter

Reference for settings: https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context_migration/

In [13]:
total_response_time = 0
total_faithfulness = 0
total_relevancy = 0

Settings.llm = llm_test
Settings.embed_model = embeddings
Settings.node_parser = SentenceSplitter(chunk_size=256, chunk_overlap=20)
Settings.num_output = 512
Settings.context_window = 3900

# a vector store index only needs an embed model
index = VectorStoreIndex.from_documents(eval_documents, embed_model=embeddings)

In [14]:
query_engine = index.as_query_engine(llm=llm_test)
num_questions = len(eval_questions)

for question in eval_questions:
    start_time = time.time()
    response_vector = query_engine.query(question)
    elapsed_time = time.time() - start_time
    faithfulness_result = faithfulness_llama2.evaluate_response(response=response_vector).passing
    relevancy_result = relevancy_llama2.evaluate_response(query=question, response=response_vector).passing
    
    total_response_time += elapsed_time
    total_faithfulness += faithfulness_result
    total_relevancy += relevancy_result
    
average_response_time = total_response_time / num_questions
average_faithfulness = total_faithfulness / num_questions
average_relevancy = total_relevancy / num_questions

print(f"Chunk size: 256, average_response_time: {average_response_time}, average_faithfulness: {average_faithfulness}, average_relevancy: {average_relevancy}")

Chunk size: 256, average_response_time: 27.546111536026, average_faithfulness: 1.0, average_relevancy: 0.85


In [16]:
total_response_time = 0
total_faithfulness = 0
total_relevancy = 0

Settings.llm = llm_test
Settings.embed_model = embeddings
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
Settings.num_output = 512
Settings.context_window = 3900

# a vector store index only needs an embed model
index = VectorStoreIndex.from_documents(eval_documents, embed_model=embeddings)

In [17]:
query_engine = index.as_query_engine(llm=llm_test)
num_questions = len(eval_questions)

for question in eval_questions:
    start_time = time.time()
    response_vector = query_engine.query(question)
    elapsed_time = time.time() - start_time
    faithfulness_result = faithfulness_llama2.evaluate_response(response=response_vector).passing
    relevancy_result = relevancy_llama2.evaluate_response(query=question, response=response_vector).passing
    
    total_response_time += elapsed_time
    total_faithfulness += faithfulness_result
    total_relevancy += relevancy_result
    
average_response_time = total_response_time / num_questions
average_faithfulness = total_faithfulness / num_questions
average_relevancy = total_relevancy / num_questions

print(f"Chunk size: 512, average_response_time: {average_response_time}, average_faithfulness: {average_faithfulness}, average_relevancy: {average_relevancy}")

Chunk size: 512, average_response_time: 25.311422169208527, average_faithfulness: 0.9, average_relevancy: 0.8


Testing Across Different Chunk Sizes

Evaluate a range of chunk sizes to identify which offers the most promising metrics.

In [15]:
# # Iterate over different chunk sizes to evaluate the metrics to help fix the chunk size.
# for chunk_size in [256, 512]:
#     avg_time, avg_faithfulness, avg_relevancy = evaluate_response_time_and_accuracy(chunk_size)
#     print(f"Chunk size {chunk_size} - Average Response time: {avg_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")

- As the chunk size increases, there is a minor uptick in the Average Response Time. 
- Interestingly, the Average Faithfulness seems to reach its zenith at chunk_sizeof 1024, whereas Average Relevancy shows a consistent improvement with larger chunk sizes, also peaking at 1024. 
- This suggests that a chunk size of 1024 might strike an optimal balance between response time and the quality of the responses, measured in terms of faithfulness and relevancy.

Conclusion

- Identifying the best chunk size for a RAG system is as much about intuition as it is empirical evidence. 
- With LlamaIndex’s Response Evaluation module, you can experiment with various sizes and base your decisions on concrete data. 
- When building a RAG system, always remember that chunk_size is a pivotal parameter. 
- Invest the time to meticulously evaluate and adjust your chunk size for unmatched results.

# Reference

https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5