# Chunking Strategies
In this notebook, different chunking mechanisms are being experimented with by keeping the other parameters constant in the simple RAG pipeline, in order to find the best chunking mechanism.

The idea behind this is that we need to chunk the documents, as we have a limited context window and larger documents will have high noise, which can distract the language model (LLM) from finding the relevant context. However, the chunking size also matters. We should be able to chunk documents with similar meaning together, so that the retriever will have enough chunks to provide to the LLM to answer the user's query.

In the simple RAG pipeline, we have used recursive character chunking with a chunking size of 1000. In this notebook, we will experiment with smaller and larger chunking sizes for the recursive character chunking, as well as the semantic chunking mechanism, to improve the RAG performance.

The chunking mechanisms being tested are:
- Small Chunking size for recursive character chunking 
- larger chunking size for recursive character chunking
- Semantic chunking mechanism



In [1]:
# Importing libraries
import sys
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
import pandas as pd
sys.path.insert(1, '/home/jabez/week_11/Contract-Advisor-RAG')
load_dotenv()
sys.path.insert(1, '/home/jabez/rizzbuzz with poetry/RAG-Optimization-System/scripts')
import file_loader 
import pipelines 
import evaluation

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Loading data
file_path = '/home/jabez/rizzbuzz with poetry/RAG-Optimization-System/data/cnn_dailymail_3.0.0.csv'
data = file_loader.load_csv(file_path)

### RecursiveCharacterTextSplitter with 500 chunking size

In [None]:
# Only run this code for the first time after that the vectorstore will be saved in the chroma path
# RecursiveCharacterTextSplitter with 500 chunking size
chunk_size= 500
chunk_overlap= 150
vectorstore_character = file_loader.character_text_splitter(data, chunk_size, chunk_overlap)

In [4]:
# Create or load a Chroma database
embeddings = OpenAIEmbeddings()
db = Chroma(persist_directory="/home/jabez/rizzbuzz with poetry/RAG-Optimization-System/vector_store", embedding_function=embeddings)

In [6]:
# Loading syntetic test data
syntetic_test_data =pd.read_csv('/home/jabez/rizzbuzz with poetry/RAG-Optimization-System/test_data/syntetic_test_data.csv')

In [5]:
retriver = db.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [6]:
# Adding answer to test data from simple pipeline
syntetic_test_data_with_answer = evaluation.adding_answer_to_testdata(syntetic_test_data, pipelines.simple_pipeline, db, retriver)

  warn_deprecated(


In [23]:
type(syntetic_test_data_with_answer)

datasets.arrow_dataset.Dataset

In [7]:
# Evaluating the test data from simple pipeline
simple_rag_evaluation_result = evaluation.ragas_evaluator(syntetic_test_data_with_answer)

Evaluating:  82%|████████▎ | 66/80 [00:27<00:03,  3.90it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 80/80 [00:33<00:00,  2.40it/s]


In [9]:
# Evaluation mean
result = evaluation.evaluation_mean(simple_rag_evaluation_result)

context_precision: 93.41%, faithfulness: 89.77%, answer_relevancy: 95.6%, context_recall: 88.33%


In [15]:
simple_rag_evaluation_result[['context_precision','faithfulness','answer_relevancy','context_recall']]

Unnamed: 0,context_precision,faithfulness,answer_relevancy,context_recall
0,1.0,0.8,0.881529,1.0
1,1.0,1.0,0.945981,1.0
2,1.0,0.909091,0.999533,1.0
3,1.0,0.75,1.0,0.5
4,0.755556,1.0,1.0,1.0
5,1.0,0.5,1.0,1.0
6,1.0,1.0,0.862756,1.0
7,0.926667,1.0,0.9946,0.833333
8,1.0,1.0,1.0,1.0
9,0.916667,1.0,1.0,1.0


### RecursiveCharacterTextSplitter with 1000 chunking size

In [3]:
# RecursiveCharacterTextSplitter with 1000 chunking size
persist_directory=persist_directory = '/home/jabez/rizzbuzz with poetry/RAG-Optimization-System/large_vector_db'
chunk_size= 1000
chunk_overlap= 250
vectorstore_character = file_loader.character_text_splitter(data, chunk_size, chunk_overlap, persist_directory)

In [5]:
db_large = Chroma(persist_directory="/home/jabez/rizzbuzz with poetry/RAG-Optimization-System/large_vector_db", embedding_function=embeddings)

In [7]:
retriver_large = db_large.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [11]:
# Adding answer to test data from simple pipeline
syntetic_test_data_with_answer = evaluation.adding_answer_to_testdata(syntetic_test_data, pipelines.simple_pipeline, db_large, retriver_large)

  warn_deprecated(


In [13]:
# Evaluating the test data from simple pipeline
simple_rag_evaluation_result = evaluation.ragas_evaluator(syntetic_test_data_with_answer)

Evaluating: 100%|██████████| 80/80 [00:38<00:00,  2.09it/s]


In [14]:
# Evaluation mean
result = evaluation.evaluation_mean(simple_rag_evaluation_result)

context_precision: 95.58%, faithfulness: 86.82%, answer_relevancy: 85.93%, context_recall: 88.92%


#### Semantic Chunking

In [15]:
# Setting semantic text splitter
vectorstore_semantic = file_loader.semantic_text_splitter(data)

In [3]:
# Create or load a Chroma database
embeddings = OpenAIEmbeddings()
db_chunking = Chroma(persist_directory="/home/jabez/rizzbuzz with poetry/RAG-Optimization-System/semantic_vector_db", embedding_function=embeddings)

In [4]:
# Setting retriever for semantic based chuncking
retriver_semantic = db_chunking.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [None]:
# Adding answer to test data from simple pipeline
syntetic_test_data_with_answer = evaluation.adding_answer_to_testdata(syntetic_test_data, pipelines.simple_pipeline, db_chunking, retriver_semantic)

In [21]:
# Evaluating the test data from simple pipeline
simple_rag_evaluation_result = evaluation.ragas_evaluator(syntetic_test_data_with_answer)

Evaluating: 100%|██████████| 80/80 [00:55<00:00,  1.44it/s]


In [22]:
# Evaluation mean
result = evaluation.evaluation_mean(simple_rag_evaluation_result)

context_precision: 90.02%, faithfulness: 80.08%, answer_relevancy: 77.02%, context_recall: 82.92%


### Results:
- Recursive Character Chunking with 500 chunk size:
context_precision: 94.83%, faithfulness: 92.05%, answer_relevancy: 86.31%, context_recall: 90.5%
- Recursive Character Chunking with 1000 chunk size:
context_precision: 94.17%, faithfulness: 91.71%, answer_relevancy: 81.33%, context_recall: 76.83%
- Semantic chunking result:
context_precision: 90.02%, faithfulness: 80.08%, answer_relevancy: 77.02%, context_recall: 82.92%
