# Chunking Strategies
In this notebook, different chunking mechanisms are being experimented with by keeping the other parameters constant in the simple RAG pipeline, in order to find the best chunking mechanism.

The idea behind this is that we need to chunk the documents, as we have a limited context window and larger documents will have high noise, which can distract the language model (LLM) from finding the relevant context. However, the chunking size also matters. We should be able to chunk documents with similar meaning together, so that the retriever will have enough chunks to provide to the LLM to answer the user's query.

In the simple RAG pipeline, we have used recursive character chunking with a chunking size of 1000. In this notebook, we will experiment with smaller and larger chunking sizes for the recursive character chunking, as well as the semantic chunking mechanism, to improve the RAG performance.

The chunking mechanisms being tested are:
- Small Chunking size for recursive character chunking 
- larger chunking size for recursive character chunking
- Semantic chunking mechanism



In [18]:
# Importing libraries
import sys
import json
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
import pandas as pd
load_dotenv()
sys.path.insert(1, '/home/jabez/rizzbuzz with poetry/RAG-Optimization-System/scripts')
import file_loader 
import pipelines 
import evaluation

In [14]:
# Load JSON from file
json_path = '../filepath.json'

with open(json_path, 'r') as json_file:
    file_paths = json.load(json_file)
data_file_path = file_paths['data_file_path']
synthetic_test_data_path = file_paths['synthetic_test_data_path']

# loading data
data = file_loader.load_csv(data_file_path)

# loading synthetic test data
synthetic_test_data = pd.read_csv(synthetic_test_data_path)

# loading persist directory for smaller chunck vector db
persist_directory_for_smaller_chunck_vector_db = file_paths['persist_directory_for_smaller_chunck_vector_db']

# loading persist directory for larger chunck vector db
persist_directory_for_larger_chunk_vector_db = file_paths['persist directory for larger_chunk_vector_db']

# loading persist directory for semantic vector db
persist_directory_for_semantic_vector_db = file_paths['persist_directory_for_semantic_vector_db']

### RecursiveCharacterTextSplitter with 500 chunking size

In [4]:
# Load a Chroma database
embeddings = OpenAIEmbeddings()
smaller_chunck_vector_db = Chroma(persist_directory=persist_directory_for_smaller_chunck_vector_db, embedding_function=embeddings)

# Setting the retriever
retriver = smaller_chunck_vector_db.as_retriever(search_type="similarity", search_kwargs={"k": 6})

# Adding answer to test data from simple pipeline
syntetic_test_data_with_answer = evaluation.adding_answer_to_testdata(synthetic_test_data, pipelines.simple_pipeline, smaller_chunck_vector_db, retriver)

In [7]:
# Evaluating the test data from simple pipeline
simple_rag_evaluation_result = evaluation.ragas_evaluator(syntetic_test_data_with_answer)

Evaluating:  82%|████████▎ | 66/80 [00:27<00:03,  3.90it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 80/80 [00:33<00:00,  2.40it/s]


In [9]:
# Evaluation mean
result = evaluation.evaluation_mean(simple_rag_evaluation_result)

context_precision: 93.41%, faithfulness: 89.77%, answer_relevancy: 95.6%, context_recall: 88.33%


### RecursiveCharacterTextSplitter with 1000 chunking size

In [15]:
db_large = Chroma(persist_directory=persist_directory_for_larger_chunk_vector_db, embedding_function=embeddings)

In [7]:
retriver_large = db_large.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [11]:
# Adding answer to test data from simple pipeline
syntetic_test_data_with_answer = evaluation.adding_answer_to_testdata(synthetic_test_data, pipelines.simple_pipeline, db_large, retriver_large)

  warn_deprecated(


In [13]:
# Evaluating the test data from simple pipeline
simple_rag_evaluation_result = evaluation.ragas_evaluator(syntetic_test_data_with_answer)

Evaluating: 100%|██████████| 80/80 [00:38<00:00,  2.09it/s]


In [14]:
# Evaluation mean
result = evaluation.evaluation_mean(simple_rag_evaluation_result)

context_precision: 95.58%, faithfulness: 86.82%, answer_relevancy: 85.93%, context_recall: 88.92%


#### Semantic Chunking

In [None]:
vectorstore_semantic = file_loader.character_text_splitter_large_embedding(data, persist_directory_for_semantic_vector_db)

In [15]:
# Setting semantic text splitter
vectorstore_semantic = file_loader.semantic_text_splitter(data)

In [16]:
# Create or load a Chroma database
embeddings = OpenAIEmbeddings()
db_semantic = Chroma(persist_directory=persist_directory_for_semantic_vector_db, embedding_function=embeddings)

In [17]:
# Setting retriever for semantic based chuncking
retriver_semantic = db_semantic.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [None]:
# Adding answer to test data from simple pipeline
syntetic_test_data_with_answer = evaluation.adding_answer_to_testdata(synthetic_test_data, pipelines.simple_pipeline, db_semantic, retriver_semantic)

In [21]:
# Evaluating the test data from simple pipeline
simple_rag_evaluation_result = evaluation.ragas_evaluator(syntetic_test_data_with_answer)

Evaluating: 100%|██████████| 80/80 [00:55<00:00,  1.44it/s]


In [22]:
# Evaluation mean
result = evaluation.evaluation_mean(simple_rag_evaluation_result)

context_precision: 90.02%, faithfulness: 80.08%, answer_relevancy: 77.02%, context_recall: 82.92%


### Results:
- Recursive Character Chunking with 500 chunk size:
context_precision: 94.83%, faithfulness: 92.05%, answer_relevancy: 86.31%, context_recall: 90.5%
- Recursive Character Chunking with 1000 chunk size:
context_precision: 94.17%, faithfulness: 91.71%, answer_relevancy: 81.33%, context_recall: 76.83%
- Semantic chunking result:
context_precision: 90.02%, faithfulness: 80.08%, answer_relevancy: 77.02%, context_recall: 82.92%
