# Chunk optimization

I order to abide by a the context window of LLMs, documents are usually split into smaller parts when creating RAG pipelines. This is called chunking. While chunking comes with the added benefits of reducing costs and noise in the *generation* step, it also introduces a new problem: "How do we prevent losing important information when splitting the document into chunks?"

In baseline RAG, we usually split the document into chunks of fixed size including a fixed overlap between adjacent chunks. In most common cases this practice works well and it is computationally efficient and does not require any NLP models.

This notebook explores the problem of chunk optimization by exploring a few different strategies:

1. **Fixed size chunking**: Split the document into chunks of fixed size.
2. **Semantic chunking**: Considers the semantic meaning behind the text and divides the document into meaningful semantic chunks
3. **Hyperparameter tuning**: Traditional ML via grid-search

Other strategies include

1. **Document Specific Chunking**: Split the document based on the logical sections of the document. Useful for Markdown, HTML, etc.
2. **Recursive Chunking**: Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators. If the initial attempt at splitting the text doesn’t produce chunks of the desired size or structure, the method recursively calls itself on the resulting chunks with a different separator or criterion until the desired chunk size or structure is achieved. 
3. **Agentic Chunk**: Use LLMs as "agents" and split the document into chunks in the fasion a human would do - start at the top and continue down the document while deciding whether to start a new chunk given the current sentence. 


### Setup libraries and environment

In [None]:
import os
from dotenv import load_dotenv
from util.helpers import get_wiki_pages, create_and_save_wiki_md_files, pretty_print_node

from IPython.display import display, Markdown

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

Add the following to a `.env` file in the root of the project if not already there.

```
OPENAI_API_KEY=<YOUR_KEY_HERE>
```

In [None]:
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

In [None]:
pages = get_wiki_pages(["Vincent Van Gogh"])
create_and_save_wiki_md_files(pages=pages, path="./data/docs/wiki/")
documents = SimpleDirectoryReader("./data/docs/wiki/").load_data()

In [None]:
embedding = OpenAIEmbedding(api_key=OPENAI_API_KEY, model="text-embedding-3-small")
llm = OpenAI(api_key=OPENAI_API_KEY, model="gpt-4-turbo")

## Fixed size chunking

In [None]:
fixed_size_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=40)
fixed_nodes = fixed_size_splitter.get_nodes_from_documents(documents)


In [None]:

display(Markdown(f'{"\n\n------------\n\n".join([node.get_content() for node in fixed_nodes[2:6]])}'))

In [None]:
fixed_index = VectorStoreIndex(nodes=fixed_nodes)
fixed_query_engine = fixed_index.as_query_engine(llm=llm)

## Semantic chunking

In [None]:
semantic_splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embedding)
semantic_nodes = semantic_splitter.get_nodes_from_documents(documents)

In [None]:
display(Markdown(f'{"\n\n------------\n\n".join([node.get_content() for node in semantic_nodes[3:7]])}'))

In [None]:
semantic_index = VectorStoreIndex(nodes=semantic_nodes)
semantic_query_engine = semantic_index.as_query_engine(llm=llm)

## Compare the different chunking strategies

In [None]:
query = "Tell me about Vincent Van Gogh's early life"

In [None]:
fixed_retriever = fixed_index.as_retriever()
fixed_retrieved_nodes = fixed_retriever.retrieve(query)
pretty_print_node(fixed_retrieved_nodes[0])

In [None]:
semantic_retriever = semantic_index.as_retriever()
semantic_retrieved_nodes = semantic_retriever.retrieve(query)
pretty_print_node(semantic_retrieved_nodes[0])

In [None]:
fixed_response = fixed_query_engine.query(
    query
)
print(str(fixed_response))

In [None]:
semantic_response = semantic_query_engine.query(
    query
)
print(str(semantic_response))