# **The Art of Chunking**

> Reference: https://towardsdatascience.com/the-art-of-chunking-boosting-ai-performance-in-rag-architectures-acdbdb8bdc2b

## **Techniques to Improve Chunking**

### **1. Fixed Character Sizes**

#### Pros:
- **Simplicity**: Easy to implement and requires minimal computational resources.
- **Consistency**: Produces uniform chunks, simplifying downstream processing.
#### Cons:
- **Context Ignorance**: Ignores the structure and meaning of the text, resulting in fragmented information.
- **Inefficiency**: May cut off important context, requiring additional processing to reassemble meaningful information.

In [1]:
# Sample text to chunk
text = "This is the text I would like to chunk up. It is the example text for this exercise."

# Set the chunk size
chunk_size = 35
# Initialize a list to hold the chunks
chunks = []
# Iterate over the text to create chunks
for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)
# Display the chunks
print(chunks)
# Output: ['This is the text I would like to ch', 'unk up. It is the example text for ', 'this exercise']

['This is the text I would like to ch', 'unk up. It is the example text for ', 'this exercise.']


- Using LangChain’s CharacterTextSplitter to achieve the same result:

In [2]:
from langchain.text_splitter import CharacterTextSplitter

# Initialize the text splitter with specified chunk size
text_splitter = CharacterTextSplitter(chunk_size=35, chunk_overlap=0, separator='', strip_whitespace=False)
# Create documents using the text splitter
documents = text_splitter.create_documents([text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output: 
# This is the text I would like to ch
# unk up. It is the example text for 
# this exercise

This is the text I would like to ch
unk up. It is the example text for 
this exercise.


### **2. Recursive Character Chunking**

#### Pros:
- **Improved Context**: This method preserves the text’s natural structure using separators like paragraphs or sentences.
- **Flexibility**: Allows for varying chunk sizes and overlaps, providing better control over the chunking process.
#### Cons:
- **The chunk size matters**: It should be manageable but still contain at least one phrase or more. Otherwise, we need to gain precision while retrieving the chunk.
- **Performance Overhead**: Requires more computational resources due to recursive splitting and handling of multiple separators. And we generate more chunks compared to fixed-size chunks.

In [3]:
# !pip install -qU langchain-text-splitters

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Sample text to chunk
text = """
The Olympic Games, originally held in ancient Greece, were revived in 1896 and
have since become the world’s foremost sports competition, bringing together 
athletes from around the globe.
"""
# Initialize the recursive character text splitter with specified chunk size
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=30,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

# Create documents using the text splitter
documents = text_splitter.create_documents([text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output:
# “The Olympic Games, originally”
# “held in ancient Greece, were”
# “revived in 1896 and have”
# “have since become the world’s”
# “world’s foremost sports”
# “competition, bringing together”
# “together athletes from around”
# “around the globe.”

The Olympic Games, originally
Games, originally held in
originally held in ancient
held in ancient Greece, were
Greece, were revived in 1896
revived in 1896 and
have since become the world’s
become the world’s foremost
world’s foremost sports
foremost sports competition,
sports competition, bringing
bringing together
athletes from around the
from around the globe.


### **3. Document-Specific Splitting**

#### Pros:
- **Relevance**: Different document types are split using the most appropriate method, preserving their logical structure.
- **Precision**: Tailors the chunking process to the unique characteristics of each document type.
#### Cons:
- **Complex Implementation**: Requires different chunking strategies and libraries for various document types.
- **Maintenance**: Maintenance is more complex due to the diversity of methods.

#### **3.1 Markdown Splitting**

In [5]:
from langchain.text_splitter import MarkdownTextSplitter
# Sample Markdown text
markdown_text = """
# Fun in California
## Driving
Try driving on the 1 down to San Diego
### Food
Make sure to eat a burrito while you're there
## Hiking
Go to Yosemite
"""
# Initialize the Markdown text splitter
splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)
# Create documents using the text splitter
documents = splitter.create_documents([markdown_text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output:
# # Fun in California\n\n## Driving
# Try driving on the 1 down to San Diego
# ### Food
# Make sure to eat a burrito while you're
# there
# ## Hiking\n\nGo to Yosemite

# Fun in California
## Driving
Try driving on the 1 down to San Diego
### Food
Make sure to eat a burrito while you're
there
## Hiking
Go to Yosemite


#### **3.2 Python Code Splitting**

In [6]:
from langchain.text_splitter import PythonCodeTextSplitter
# Sample Python code
python_text = """
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
p1 = Person("John", 36)
for i in range(10):
    print(i)
"""
# Initialize the Python code text splitter
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
# Create documents using the text splitter
documents = python_splitter.create_documents([python_text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output:
# class Person:\n    def __init__(self, name, age):\n        self.name = name\n        self.age = age
# p1 = Person("John", 36)\n\nfor i in range(10):\n    print(i)

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
p1 = Person("John", 36)
for i in range(10):
    print(i)


### **4. Semantic Splitting**

#### Pros:
- **Contextual Relevance**: Ensures that chunks contain semantically similar content, enhancing the accuracy of information retrieval and generation.
- **Dynamic Adaptability**: Can adapt to various text structures and content types based on meaning rather than rigid rules.
#### Cons:
- **Computational Overhead**: Requires additional computational resources to generate and compare embeddings.
- **Complexity**: More complex to implement compared to simpler splitting methods.

> Reference: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
from langchain.embeddings import OpenAIEmbeddings
import re
# Sample text
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.
Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.
It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer.
"""
# Splitting the text into sentences
sentences = re.split(r'(?<=[.?!])\s+', text)
sentences = [{'sentence': x, 'index' : i} for i, x in enumerate(sentences)]
# Combine sentences for context
def combine_sentences(sentences, buffer_size=1):
    for i in range(len(sentences)):
        combined_sentence = ''
        for j in range(i - buffer_size, i):
            if j >= 0:
                combined_sentence += sentences[j]['sentence'] + ' '
        combined_sentence += sentences[i]['sentence']
        for j in range(i + 1, i + 1 + buffer_size):
            if j < len(sentences):
                combined_sentence += ' ' + sentences[j]['sentence']
        sentences[i]['combined_sentence'] = combined_sentence
    return sentences
sentences = combine_sentences(sentences)
# Generate embeddings
oai_embeds = OpenAIEmbeddings()
embeddings = oai_embeds.embed_documents([x['combined_sentence'] for x in sentences])
# Add embeddings to sentences
for i, sentence in enumerate(sentences):
    sentence['combined_sentence_embedding'] = embeddings[i]
# Calculate cosine distances
def calculate_cosine_distances(sentences):
    distances = []
    for i in range(len(sentences) - 1):
        embedding_current = sentences[i]['combined_sentence_embedding']
        embedding_next = sentences[i + 1]['combined_sentence_embedding']
        similarity = cosine_similarity([embedding_current], [embedding_next])[0][0]
        distance = 1 - similarity
        distances.append(distance)
        sentences[i]['distance_to_next'] = distance
    return distances, sentences
distances, sentences = calculate_cosine_distances(sentences)
# Determine breakpoints and create chunks
import numpy as np
breakpoint_distance_threshold = np.percentile(distances, 95)
indices_above_thresh = [i for i, x in enumerate(distances) if x > breakpoint_distance_threshold]
# Combine sentences into chunks
chunks = []
start_index = 0
for index in indices_above_thresh:
    end_index = index
    group = sentences[start_index:end_index + 1]
    combined_text = ' '.join([d['sentence'] for d in group])
    chunks.append(combined_text)
    start_index = index + 1
if start_index < len(sentences):
    combined_text = ' '.join([d['sentence'] for d in sentences[start_index:]])
    chunks.append(combined_text)
# Display the created chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk #{i+1}:\n{chunk}\n")

  oai_embeds = OpenAIEmbeddings()


Chunk #1:

One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear. Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers.

Chunk #2:
You get no customers, and you go out of business. It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. 



### **4. Agentic Splitting**

#### Pros:
- **High Precision**: Provides highly relevant and contextually accurate chunks by using sophisticated language models.
- **Adaptability**: Can handle diverse types of text and adjust chunking strategies on the fly.
#### Cons:
- **Resource Intensive and Additional LLM cost**: Requires significant computational resources to run large language models.
- **Complex Implementation**: Involves setting up and fine-tuning language models for optimal performance.

In [8]:
# !pip install langgraph

In [1]:
from langgraph.nodes import InputNode, SentenceSplitterNode, LLMDecisionNode, ChunkingNode

# Step 1: Input Node
input_node = InputNode(name="Document Input")

# Step 2: Sentence Splitting Node
splitter_node = SentenceSplitterNode(input=input_node.output, name="Sentence Splitter")

# Step 3: LLM Decision Node
decision_node = LLMDecisionNode(
    input=splitter_node.output, 
    prompt_template="Does the sentence '{next_sentence}' belong to the same chunk as '{current_chunk}'?", 
    name="LLM Decision"
)

# Step 4: Chunking Node
chunking_node = ChunkingNode(input=decision_node.output, name="Semantic Chunking")

# Run the graph
document = "Your document text here..."
result = chunking_node.run(document=document)
print(result)

ImportError: cannot import name 'InputNode' from 'langgraph.prebuilt' (/root/anaconda3/envs/loki-311/lib/python3.11/site-packages/langgraph/prebuilt/__init__.py)

In [2]:
try:
    from langgraph.core import InputNode, SentenceSplitterNode, LLMDecisionNode, ChunkingNode

    # Step 1: Input Node
    input_node = InputNode(name="Document Input")

    # Step 2: Sentence Splitting Node
    splitter_node = SentenceSplitterNode(input=input_node.output, name="Sentence Splitter")

    # Step 3: LLM Decision Node
    decision_node = LLMDecisionNode(
        input=splitter_node.output, 
        prompt_template="Does the sentence '{next_sentence}' belong to the same chunk as '{current_chunk}'?", 
        name="LLM Decision"
    )

    # Step 4: Chunking Node
    chunking_node = ChunkingNode(input=decision_node.output, name="Semantic Chunking")

    # Run the graph
    document = "Your document text here..."
    result = chunking_node.run(document=document)
    print(result)

except Exception as e:
    print(f"An error occurred: {e}")

An error occurred: No module named 'langgraph.nodes'
