# Chunking for RAG

we will be studying about the various levels and types of chunking stratergies that we can employee for better retrieval performance of the RAG application.

### Importing libaries

In [2]:
# importing libraries
import langchain
import llama_index

### Install new libraries

In [43]:
# pip install langchain
# pip install llama-index
# pip install "unstructured[all-docs]"

SyntaxError: invalid syntax. Perhaps you forgot a comma? (2607182387.py, line 2)

## Level 1: Character Splitting

 - **Chunk Size-** The number of characters you would like in your chunks. 50,100, 1000, etcs
 - **Chunk Overlap-** The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data acroos chunks.

In [10]:
text = "This is the text I would like to chunk up. It is example text for this exercise."

let's split this example manually.

In [11]:
# create a list that will hold your chunks
chunks = []

CHUNK_SIZE = 35 # characters

# run through the text/document and depending on the chunk_size
for i in range(0, len(text), CHUNK_SIZE):
    chunk = text[i:i+CHUNK_SIZE]
    chunks.append(chunk)

# print the chunks that we generated though text
chunks

['This is the text I would like to ch',
 'unk up. It is example text for this',
 ' exercise.']

In [3]:
from langchain.text_splitter import CharacterTextSplitter

In [22]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap = 0, separator = '', strip_whitespace = False)

using CharacterTextSplitter we can now proceed using create_documents to split the text. But `create_documents` expects a list of texts, so we will have to wrap it in [].

In [23]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='This is the text I would like to ch'),
 Document(metadata={}, page_content='unk up. It is example text for this'),
 Document(metadata={}, page_content=' exercise.')]

In Langchain, each chunk is held in Document. Document in Langchain is a object that can hold metadata and page_content as shown above.

##### Chunk Overlap

In [24]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap = 5, separator = '')

In [25]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='This is the text I would like to ch'),
 Document(metadata={}, page_content='to chunk up. It is example text for'),
 Document(metadata={}, page_content='t for this exercise.')]

#### Separator

In [40]:
text_splitter = CharacterTextSplitter(chunk_size = 45, chunk_overlap =0, separator = 'ch')

In [39]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='This is the text I would like to'),
 Document(metadata={}, page_content='unk up. It is example text for this exercise.')]

chunk_overlap and separator doesn't go along with eachother. We can only use one at a time.

#### Using Llama Index

Llama Index is also a great choice for chunking and indexing process. They provide node relationship out of the box which can aid in retrieval later.

In [4]:
from llama_index.core.text_splitter import SentenceSplitter
from llama_index.core import SimpleDirectoryReader

In [51]:
splitter = SentenceSplitter(
    chunk_size = 200,
    chunk_overlap = 15
)

In [52]:
# lets load files using SimpleDirectoryReader
documents = SimpleDirectoryReader(
    input_files = ["./data/mit.txt"]
).load_data()

In [61]:
documents

[Document(id_='4480bb4a-2992-4f43-93c1-7152ecf6ff60', embedding=None, metadata={'file_path': 'data\\mit.txt', 'file_name': 'mit.txt', 'file_type': 'text/plain', 'file_size': 36045, 'creation_date': '2025-03-18', 'last_modified_date': '2025-03-18'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='\n\nWant to start a startup?  Get funded by\nY Combinator.\n\n\n\n\nOctober 2006(This essay is derived from a talk at MIT.)\nTill recently graduating seniors had two choices: get a job or go\nto grad school.  I think there will increasingly be a third option:\nto start your own startup.  But how common will that be?I\'m sure the default will always

We can see that SimpleDirectoryReader adds more information when loading the data, that could be useful after chunking the data in the document.

In Llama Index nomenclature for chunk of data is node. Nodes are similar to documents but with more relationship data added to them.

In [62]:
# splitting the data using llama_index SentenceSplitter
nodes = splitter.get_nodes_from_documents(documents)

In [63]:
# viewing the single chunk
print(nodes[0])

# seeing other metadata that is loaded in the node
nodes[0]

Node ID: 76ef7900-f1b9-4d8c-a44a-a1955c98caca
Text: Want to start a startup?  Get funded by Y Combinator.
October 2006(This essay is derived from a talk at MIT.) Till recently
graduating seniors had two choices: get a job or go to grad school.  I
think there will increasingly be a third option: to start your own
startup.  But how common will that be?I'm sure the default will always
be to get ...


TextNode(id_='76ef7900-f1b9-4d8c-a44a-a1955c98caca', embedding=None, metadata={'file_path': 'data\\mit.txt', 'file_name': 'mit.txt', 'file_type': 'text/plain', 'file_size': 36045, 'creation_date': '2025-03-18', 'last_modified_date': '2025-03-18'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='4480bb4a-2992-4f43-93c1-7152ecf6ff60', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'data\\mit.txt', 'file_name': 'mit.txt', 'file_type': 'text/plain', 'file_size': 36045, 'creation_date': '2025-03-18', 'last_modified_date': '2025-03-18'}, hash='9d9100a4b75ae4d5bae2183afa516771e67968deec164e25896e466d37f66865'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='32542a12-12e7-4daf-8733-182b9ec5986

We can see that there are various metadata and relation data that has been assigned to the chunk like 
 - id
 - embedding
 - metadata
 - excluded_embed_metadata_keys
 - excluded_llm_metadata_keys
 - relationships
 - and so on.
 

this will help LLM to retrieve the data more efficiently.

### Level 2: Recrusive Character Text Splitting

The problem with Level #1 is that we don't take into account the structure of our document at all. We simply split by a fix number of characters.

The Recursive Character Text Splitter helps with this. With it, we'll specify a series of separatators which will be used to split our docs.

You can see the default separators for LangChain [here](https://github.com/langchain-ai/langchain/blob/9ef2feb6747f5a69d186bd623b569ad722829a5e/libs/langchain/langchain/text_splitter.py#L842). Let's take a look at them one by one.

 - "\n\n" - Double new line, or most commonly paragraph breaks
 - "\n" - New lines
 - " " - Spaces
 - "" - Characters
 
I'm not sure why a period (".") isn't included on the list, perhaps it is not universal enough? If you know, let me know.

This is the swiss army knife of splitters and my first choice when mocking up a quick application. If you don't know which splitter to start with, this is a good first bet.

Let's try it out

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [10]:
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

In [97]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap = 10)

In [98]:
# splitting the text
chunks = text_splitter.create_documents([text])

# viewing the chunk
chunks

[Document(metadata={}, page_content="One of the most important things I didn't understand about the"),
 Document(metadata={}, page_content='about the world when I was a child is the degree to which the'),
 Document(metadata={}, page_content='which the returns for performance are superlinear.'),
 Document(metadata={}, page_content='Teachers and coaches implicitly told us the returns were linear.'),
 Document(metadata={}, page_content='linear. "You get out," I heard a thousand times, "what you put'),
 Document(metadata={}, page_content='you put in." They meant well, but this is rarely true. If your'),
 Document(metadata={}, page_content="If your product is only half as good as your competitor's, you"),
 Document(metadata={}, page_content="you don't get half as many customers. You get no customers, and"),
 Document(metadata={}, page_content='and you go out of business.'),
 Document(metadata={}, page_content="It's obviously true that the returns for performance are"),
 Document(metadata={}

We can see that we are getting complete words in the chunks, but we are still splitting the sentences and that's not good. We are loosing the context because of the split that is happening between the sentences.

We can see many chunks ending with ".", that's because that must be the end of the paragraph and RecursiveCharacterTextSplitter took into consideration the splitter "\n\n". 

When the chunk size is still larger then it will look for the next splitter i.e. "\n" and so on as meniton above about the splitter present in the RecursiveCharacterTextSplitter. 

To overcome the sentences getting separatored from between we can increase the chunk_size to a larger number, let's see.

In [8]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap = 10)

In [11]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content="One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."),
 Document(metadata={}, page_content='Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.'),
 Document(metadata={}, page_content="It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, th

We can see that most of the chunks are ending with ".".

### Level 3: Document Specific Splitting

We will learn about chunking in a document where we might have not only text, but code snipets, tables, images and many more things. 

Let's start with markdown, python and JS Splitter.

We should also see the document splitter in LangChain and Llama Index

#### Markdown

We have different set of splitter for markdown in Langchain like header, code blocks, horizontal lines, double lines and so on.

Let's try

In [14]:
# import MarkdownTextSplitter from langchain
from langchain.text_splitter import MarkdownTextSplitter

In [32]:
# initiate the markdown splitter
splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap = 0)

In [33]:
# define the markdown text for using as example

markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

In [34]:
# split the markdown text
splitter.create_documents([markdown_text])

[Document(metadata={}, page_content='# Fun in California\n\n## Driving'),
 Document(metadata={}, page_content='Try driving on the 1 down to San Diego'),
 Document(metadata={}, page_content='### Food'),
 Document(metadata={}, page_content="Make sure to eat a burrito while you're"),
 Document(metadata={}, page_content='there'),
 Document(metadata={}, page_content='## Hiking\n\nGo to Yosemite')]

We can see that how the markdown text is getting splitted into different chunks using heading, text and different things.

But still it's not better as we are getting a chunk with just "there" in it. 

#### Python

We can do it for python, but instead of having markdown splitter we can have a python splitter.

In [35]:
# import python code splitter
from langchain.text_splitter import PythonCodeTextSplitter

In [37]:
# initiate the python code splitter
python_splitter = PythonCodeTextSplitter(chunk_size = 100, chunk_overlap = 0)

In [38]:
# python code as example
python_text = """
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

for i in range(10):
    print (i)
"""

In [39]:
python_splitter.create_documents([python_text])

[Document(metadata={}, page_content='class Person:\n  def __init__(self, name, age):\n    self.name = name\n    self.age = age'),
 Document(metadata={}, page_content='p1 = Person("John", 36)\n\nfor i in range(10):\n    print (i)')]

We will have to change the chunk_size for better chunking of code snippets.

#### JavaScript

Similar to python we have bunch of separator for javascript. We will use RecursiveCharacterTextSplitter and pass the language like JS that we are splitting.

In [42]:
# import recursive and language 
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

In [43]:
# example
javascript_text = """
// Function is called, the return value will end up in x
let x = myFunction(4, 3);

function myFunction(a, b) {
// Function returns the product of a and b
  return a * b;
}
"""

In [46]:
# initiate the recursive splitter
javascript_splitter = RecursiveCharacterTextSplitter.from_language(
    language = Language.JS, chunk_size = 65, chunk_overlap = 0
)

In [47]:
javascript_splitter.create_documents([javascript_text])

[Document(metadata={}, page_content='// Function is called, the return value will end up in x'),
 Document(metadata={}, page_content='let x = myFunction(4, 3);'),
 Document(metadata={}, page_content='function myFunction(a, b) {'),
 Document(metadata={}, page_content='// Function returns the product of a and b\n  return a * b;\n}')]

### PDF w/ Tables

PDF are most common source of data for Language model. Often they contain tables and images that could be financial data, studies, academic papers, etc.

Character based splitting is not the best practise of pdf and for deep dive there is one lecture from Lance Martin w/ Langchain.

We are just seeing the text based method in this, but we can pull tables using vision rather then text.

using Unstructed, library that is dedicated to making data LLM ready.

In [2]:
# import libraries
import os
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json

In [3]:
filename = ".\data\SalesforceFinancialReport.pdf"

# Extract the elements from the pdf
elements = partition_pdf(
    filename = filename,
    
    # unstructured helper
    stratergy = "hi_res",
    infer_table_structure = True,
    model = 'yolox'
)

yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?