# **The Art of Chunking**

> Reference: https://towardsdatascience.com/the-art-of-chunking-boosting-ai-performance-in-rag-architectures-acdbdb8bdc2b

## **Techniques to Improve Chunking**

### **1. Fixed Character Sizes**

#### Pros:
- **Simplicity**: Easy to implement and requires minimal computational resources.
- **Consistency**: Produces uniform chunks, simplifying downstream processing.
#### Cons:
- **Context Ignorance**: Ignores the structure and meaning of the text, resulting in fragmented information.
- **Inefficiency**: May cut off important context, requiring additional processing to reassemble meaningful information.

In [1]:
# Sample text to chunk
text = "This is the text I would like to chunk up. It is the example text for this exercise."

# Set the chunk size
chunk_size = 35
# Initialize a list to hold the chunks
chunks = []
# Iterate over the text to create chunks
for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)
# Display the chunks
print(chunks)
# Output: ['This is the text I would like to ch', 'unk up. It is the example text for ', 'this exercise']

['This is the text I would like to ch', 'unk up. It is the example text for ', 'this exercise.']


- Using LangChain’s CharacterTextSplitter to achieve the same result:

In [2]:
from langchain.text_splitter import CharacterTextSplitter

# Initialize the text splitter with specified chunk size
text_splitter = CharacterTextSplitter(chunk_size=35, chunk_overlap=0, separator='', strip_whitespace=False)
# Create documents using the text splitter
documents = text_splitter.create_documents([text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output: 
# This is the text I would like to ch
# unk up. It is the example text for 
# this exercise

This is the text I would like to ch
unk up. It is the example text for 
this exercise.


### **2. Recursive Character Chunking**

#### Pros:
- **Improved Context**: This method preserves the text’s natural structure using separators like paragraphs or sentences.
- **Flexibility**: Allows for varying chunk sizes and overlaps, providing better control over the chunking process.
#### Cons:
- **The chunk size matters**: It should be manageable but still contain at least one phrase or more. Otherwise, we need to gain precision while retrieving the chunk.
- **Performance Overhead**: Requires more computational resources due to recursive splitting and handling of multiple separators. And we generate more chunks compared to fixed-size chunks.

In [3]:
# !pip install -qU langchain-text-splitters

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Sample text to chunk
text = """
The Olympic Games, originally held in ancient Greece, were revived in 1896 and
have since become the world’s foremost sports competition, bringing together 
athletes from around the globe.
"""
# Initialize the recursive character text splitter with specified chunk size
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=30,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

# Create documents using the text splitter
documents = text_splitter.create_documents([text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output:
# “The Olympic Games, originally”
# “held in ancient Greece, were”
# “revived in 1896 and have”
# “have since become the world’s”
# “world’s foremost sports”
# “competition, bringing together”
# “together athletes from around”
# “around the globe.”

The Olympic Games, originally
Games, originally held in
originally held in ancient
held in ancient Greece, were
Greece, were revived in 1896
revived in 1896 and
have since become the world’s
become the world’s foremost
world’s foremost sports
foremost sports competition,
sports competition, bringing
bringing together
athletes from around the
from around the globe.


### **3. Document-Specific Splitting**

#### Pros:
- **Relevance**: Different document types are split using the most appropriate method, preserving their logical structure.
- **Precision**: Tailors the chunking process to the unique characteristics of each document type.
#### Cons:
- **Complex Implementation**: Requires different chunking strategies and libraries for various document types.
- **Maintenance**: Maintenance is more complex due to the diversity of methods.

#### **3.1 Markdown Splitting**

In [5]:
from langchain.text_splitter import MarkdownTextSplitter
# Sample Markdown text
markdown_text = """
# Fun in California
## Driving
Try driving on the 1 down to San Diego
### Food
Make sure to eat a burrito while you're there
## Hiking
Go to Yosemite
"""
# Initialize the Markdown text splitter
splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)
# Create documents using the text splitter
documents = splitter.create_documents([markdown_text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output:
# # Fun in California\n\n## Driving
# Try driving on the 1 down to San Diego
# ### Food
# Make sure to eat a burrito while you're
# there
# ## Hiking\n\nGo to Yosemite

# Fun in California
## Driving
Try driving on the 1 down to San Diego
### Food
Make sure to eat a burrito while you're
there
## Hiking
Go to Yosemite


#### **3.2 Python Code Splitting**

In [6]:
from langchain.text_splitter import PythonCodeTextSplitter
# Sample Python code
python_text = """
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
p1 = Person("John", 36)
for i in range(10):
    print(i)
"""
# Initialize the Python code text splitter
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
# Create documents using the text splitter
documents = python_splitter.create_documents([python_text])
# Display the created documents
for doc in documents:
    print(doc.page_content)
# Output:
# class Person:\n    def __init__(self, name, age):\n        self.name = name\n        self.age = age
# p1 = Person("John", 36)\n\nfor i in range(10):\n    print(i)

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
p1 = Person("John", 36)
for i in range(10):
    print(i)


### **4. Semantic Splitting**

#### Pros:
- **Contextual Relevance**: Ensures that chunks contain semantically similar content, enhancing the accuracy of information retrieval and generation.
- **Dynamic Adaptability**: Can adapt to various text structures and content types based on meaning rather than rigid rules.
#### Cons:
- **Computational Overhead**: Requires additional computational resources to generate and compare embeddings.
- **Complexity**: More complex to implement compared to simpler splitting methods.

> Reference: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
from langchain.embeddings import OpenAIEmbeddings
import re
# Sample text
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.
Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.
It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer.
"""
# Splitting the text into sentences
sentences = re.split(r'(?<=[.?!])\s+', text)
sentences = [{'sentence': x, 'index' : i} for i, x in enumerate(sentences)]
# Combine sentences for context
def combine_sentences(sentences, buffer_size=1):
    for i in range(len(sentences)):
        combined_sentence = ''
        for j in range(i - buffer_size, i):
            if j >= 0:
                combined_sentence += sentences[j]['sentence'] + ' '
        combined_sentence += sentences[i]['sentence']
        for j in range(i + 1, i + 1 + buffer_size):
            if j < len(sentences):
                combined_sentence += ' ' + sentences[j]['sentence']
        sentences[i]['combined_sentence'] = combined_sentence
    return sentences
sentences = combine_sentences(sentences)
# Generate embeddings
oai_embeds = OpenAIEmbeddings()
embeddings = oai_embeds.embed_documents([x['combined_sentence'] for x in sentences])
# Add embeddings to sentences
for i, sentence in enumerate(sentences):
    sentence['combined_sentence_embedding'] = embeddings[i]
# Calculate cosine distances
def calculate_cosine_distances(sentences):
    distances = []
    for i in range(len(sentences) - 1):
        embedding_current = sentences[i]['combined_sentence_embedding']
        embedding_next = sentences[i + 1]['combined_sentence_embedding']
        similarity = cosine_similarity([embedding_current], [embedding_next])[0][0]
        distance = 1 - similarity
        distances.append(distance)
        sentences[i]['distance_to_next'] = distance
    return distances, sentences
distances, sentences = calculate_cosine_distances(sentences)
# Determine breakpoints and create chunks
import numpy as np
breakpoint_distance_threshold = np.percentile(distances, 95)
indices_above_thresh = [i for i, x in enumerate(distances) if x > breakpoint_distance_threshold]
# Combine sentences into chunks
chunks = []
start_index = 0
for index in indices_above_thresh:
    end_index = index
    group = sentences[start_index:end_index + 1]
    combined_text = ' '.join([d['sentence'] for d in group])
    chunks.append(combined_text)
    start_index = index + 1
if start_index < len(sentences):
    combined_text = ' '.join([d['sentence'] for d in sentences[start_index:]])
    chunks.append(combined_text)
# Display the created chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk #{i+1}:\n{chunk}\n")

  oai_embeds = OpenAIEmbeddings()


Chunk #1:

One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear. Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers.

Chunk #2:
You get no customers, and you go out of business. It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. 



### **4. Agentic Splitting**

#### Pros:
- **High Precision**: Provides highly relevant and contextually accurate chunks by using sophisticated language models.
- **Adaptability**: Can handle diverse types of text and adjust chunking strategies on the fly.
#### Cons:
- **Resource Intensive and Additional LLM cost**: Requires significant computational resources to run large language models.
- **Complex Implementation**: Involves setting up and fine-tuning language models for optimal performance.

In [8]:
# !pip install langgraph

In [1]:
# from langgraph.nodes import InputNode, SentenceSplitterNode, LLMDecisionNode, ChunkingNode

# # Step 1: Input Node
# input_node = InputNode(name="Document Input")

# # Step 2: Sentence Splitting Node
# splitter_node = SentenceSplitterNode(input=input_node.output, name="Sentence Splitter")

# # Step 3: LLM Decision Node
# decision_node = LLMDecisionNode(
#     input=splitter_node.output, 
#     prompt_template="Does the sentence '{next_sentence}' belong to the same chunk as '{current_chunk}'?", 
#     name="LLM Decision"
# )

# # Step 4: Chunking Node
# chunking_node = ChunkingNode(input=decision_node.output, name="Semantic Chunking")

# # Run the graph
# document = "Your document text here..."
# result = chunking_node.run(document=document)
# print(result)

ImportError: cannot import name 'InputNode' from 'langgraph.prebuilt' (/root/anaconda3/envs/loki-311/lib/python3.11/site-packages/langgraph/prebuilt/__init__.py)

In [2]:
from langchain import hub
from pydantic import BaseModel
from typing import List
from langchain.chat_models import ChatOpenAI

In [4]:
# Pull the object from the hub
obj = hub.pull("wfh/proposal-indexing")

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o")

# A Pydantic model to extract sentences from the passage
class Sentences(BaseModel):
    sentences: List[str]

# Create the sentence extraction function
def extract_sentences(text):
    # Get the response from the LLM
    response = llm.invoke(text)
    
    # Extract the content of the AIMessage object
    response_text = response.content
    
    # Split the response text into sentences
    sentences = response_text.split('. ')
    
    # Create and return the structured output using the Pydantic model
    return Sentences(sentences=sentences)

# Test it out
text = """
On July 20, 1969, astronaut Neil Armstrong walked on the moon. 
He was leading NASA's Apollo 11 mission. 
Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface.
"""

# Process and validate the extracted sentences
sentences = extract_sentences(text)

# Print the extracted sentences
print(sentences)



sentences=['On July 20, 1969, astronaut Neil Armstrong made history by becoming the first human to walk on the moon', 'He was the commander of NASA\'s Apollo 11 mission, which was a monumental achievement in space exploration and a significant milestone in the Space Race between the United States and the Soviet Union.\n\nAs Armstrong descended from the lunar module, named Eagle, and set foot on the moon\'s surface, he uttered the now-iconic words, "That\'s one small step for man, one giant leap for mankind." This statement encapsulated the profound significance of the event, emphasizing both the individual achievement and its broader implications for humanity.\n\nThe Apollo 11 mission not only demonstrated the technological prowess and determination of NASA but also served as a unifying moment that captured the imagination and aspirations of people around the world', 'Alongside Armstrong, astronaut Edwin "Buzz" Aldrin also walked on the moon, while Michael Collins piloted the command m

In [4]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

chunks = {}

def create_new_chunk(chunk_id, proposition):
    summary_llm = llm.with_structured_output(ChunkMeta)

    summary_prompt_template = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "Generate a new summary and a title based on the propositions.",
            ),
            (
                "user",
                "propositions:{propositions}",
            ),
        ]
    )

    summary_chain = summary_prompt_template | summary_llm

    chunk_meta = summary_chain.invoke(
        {
            "propositions": [proposition],
        }
    )

    chunks[chunk_id] = {
        "summary": chunk_meta.summary,
        "title": chunk_meta.title,
        "propositions": [proposition],
    }

In [5]:
from langchain_core.pydantic_v1 import BaseModel, Field

class ChunkMeta(BaseModel):
    title: str = Field(description="The title of the chunk.")
    summary: str = Field(description="The summary of the chunk.")

def add_proposition(chunk_id, proposition):
    summary_llm = llm.with_structured_output(ChunkMeta)

    summary_prompt_template = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "If the current_summary and title is still valid for the propositions return them."
                "If not generate a new summary and a title based on the propositions.",
            ),
            (
                "user",
                "current_summary:{current_summary}\n\ncurrent_title:{current_title}\n\npropositions:{propositions}",
            ),
        ]
    )

    summary_chain = summary_prompt_template | summary_llm

    chunk = chunks[chunk_id]

    current_summary = chunk["summary"]
    current_title = chunk["title"]
    current_propositions = chunk["propositions"]

    all_propositions = current_propositions + [proposition]

    chunk_meta = summary_chain.invoke(
        {
            "current_summary": current_summary,
            "current_title": current_title,
            "propositions": all_propositions,
        }
    )

    chunk["summary"] = chunk_meta.summary
    chunk["title"] = chunk_meta.title
    chunk["propositions"] = all_propositions

In [6]:
def find_chunk_and_push_proposition(proposition):

    class ChunkID(BaseModel):
        chunk_id: int = Field(description="The chunk id.")

    allocation_llm = llm.with_structured_output(ChunkID)

    allocation_prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You have the chunk ids and the summaries"
                "Find the chunk that best matches the proposition."
                "If no chunk matches, return a new chunk id."
                "Return only the chunk id.",
            ),
            (
                "user",
                "proposition:{proposition}" "chunks_summaries:{chunks_summaries}",
            ),
        ]
    )

    allocation_chain = allocation_prompt | allocation_llm

    chunks_summaries = {
        chunk_id: chunk["summary"] for chunk_id, chunk in chunks.items()
    }

    best_chunk_id = allocation_chain.invoke(
        {"proposition": proposition, "chunks_summaries": chunks_summaries}
    ).chunk_id

    if best_chunk_id not in chunks:
        best_chunk_id = create_new_chunk(best_chunk_id, proposition)
        return

    add_proposition(best_chunk_id, proposition)

> Reference: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/d2da20552446179779c74ccc9e232e77ba981659/5_Levels_Of_Text_Splitting.ipynb

In [2]:
import os

from langchain.output_parsers.openai_tools import JsonOutputToolsParser
from langchain_community.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda
from langchain.chains import create_extraction_chain
from typing import Optional, List
from langchain.chains import create_extraction_chain_pydantic
from langchain_core.pydantic_v1 import BaseModel
from langchain import hub

In [3]:
obj = hub.pull("wfh/proposal-indexing")
llm = ChatOpenAI(model='gpt-4o', openai_api_key = os.getenv("OPENAI_API_KEY"))

  prompt = loads(json.dumps(prompt_object.manifest))
  llm = ChatOpenAI(model='gpt-4o', openai_api_key = os.getenv("OPENAI_API_KEY"))


In [5]:
# use it in a runnable
runnable = obj | llm

In [7]:
# Pydantic data class
class Sentences(BaseModel):
    sentences: List[str]
    
# Extraction
extraction_chain = create_extraction_chain_pydantic(pydantic_schema=Sentences, llm=llm)

  extraction_chain = create_extraction_chain_pydantic(pydantic_schema=Sentences, llm=llm)


In [9]:
def get_propositions(text):
    runnable_output = runnable.invoke({
    	"input": text
    }).content
    
    propositions = extraction_chain.run(runnable_output)[0].sentences
    return propositions

In [11]:
essay =  """
On July 20, 1969, astronaut Neil Armstrong walked on the moon. 
He was leading NASA's Apollo 11 mission. 
Armstrong famously said, "That's one small step for man, one giant leap for mankind" as he stepped onto the lunar surface.
"""

In [14]:
paragraphs = essay.split("\n")

In [16]:

essay_propositions = []

for i, para in enumerate(paragraphs):
    propositions = get_propositions(para)
    
    essay_propositions.extend(propositions)
    print (f"Done with {i}")

  propositions = extraction_chain.run(runnable_output)[0].sentences


Done with 0
Done with 1
Done with 2
Done with 3
Done with 4


In [17]:

print (f"You have {len(essay_propositions)} propositions")
essay_propositions

You have 7 propositions


['Sure, please provide the content you would like me to decompose.',
 'On July 20, 1969, Neil Armstrong walked on the moon.',
 'Neil Armstrong was an astronaut.',
 "He was leading NASA's Apollo 11 mission.",
 'Neil Armstrong famously said a quote when he stepped onto the lunar surface.',
 "Neil Armstrong said, 'That's one small step for man, one giant leap for mankind.'",
 'Sure, please provide the content you would like to have decomposed.']

In [18]:
# mini script I made
from rag.medium.agentic_chunker import AgenticChunker

In [19]:
ac = AgenticChunker()

In [20]:
ac.add_propositions(essay_propositions)


Adding: 'Sure, please provide the content you would like me to decompose.'
No chunks, creating a new one
Created new chunk (1921f): Content Analysis Requests

Adding: 'On July 20, 1969, Neil Armstrong walked on the moon.'
No chunks found
Created new chunk (56ed1): Space Exploration History

Adding: 'Neil Armstrong was an astronaut.'
Chunk Found (56ed1), adding to: Space Exploration History

Adding: 'He was leading NASA's Apollo 11 mission.'
Chunk Found (56ed1), adding to: Neil Armstrong & Moon Landing

Adding: 'Neil Armstrong famously said a quote when he stepped onto the lunar surface.'
Chunk Found (56ed1), adding to: Neil Armstrong's Astronaut Career & Historic Achievements

Adding: 'Neil Armstrong said, 'That's one small step for man, one giant leap for mankind.''
Chunk Found (56ed1), adding to: Neil Armstrong & Apollo 11 Moon Landing

Adding: 'Sure, please provide the content you would like to have decomposed.'
Chunk Found (1921f), adding to: Content Analysis Requests


In [21]:
ac.pretty_print_chunks()


You have 2 chunks

Chunk #0
Chunk ID: 1921f
Summary: This chunk contains conversations about requests for content decomposition or analysis.
Propositions:
    -Sure, please provide the content you would like me to decompose.
    -Sure, please provide the content you would like to have decomposed.



Chunk #1
Chunk ID: 56ed1
Summary: This chunk contains information about Neil Armstrong's life, his career with NASA, details of the Apollo 11 mission, and his iconic quote from the moon landing.
Propositions:
    -On July 20, 1969, Neil Armstrong walked on the moon.
    -Neil Armstrong was an astronaut.
    -He was leading NASA's Apollo 11 mission.
    -Neil Armstrong famously said a quote when he stepped onto the lunar surface.
    -Neil Armstrong said, 'That's one small step for man, one giant leap for mankind.'





In [22]:
chunks = ac.get_chunks(get_type='list_of_strings')
chunks

['Sure, please provide the content you would like me to decompose. Sure, please provide the content you would like to have decomposed.',
 "On July 20, 1969, Neil Armstrong walked on the moon. Neil Armstrong was an astronaut. He was leading NASA's Apollo 11 mission. Neil Armstrong famously said a quote when he stepped onto the lunar surface. Neil Armstrong said, 'That's one small step for man, one giant leap for mankind.'"]