# RAG from Scratch: Indexing

In [1]:
%pip install -q -U langchain langchain_core


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Chunking

Chunking is important to extract and group relevant informations in the same chunk so that when the retriever retrieve the data, the correct chunk will be pulled and it will bring the most relevant information.

The chunk can be processed and embedded, and later the AI system can retrieve the embeddings to obtain the most relevant sources.

Why Chunking is important?
- LLM and RAG pipelines has limitation in the context windows and computational constraints, so we have to fit as much information as we can inside the same chunk.
- Without proper chunking, we lose important contextual relationship and struggle to identify relevant information during retrieval.
- Effective chunking will enhance the precision due to the semantically coherent segments that align with query patterns and user intent.



## Common Chunking Strategies (w/ examples):

> Example text:
>
>"The Journey of a River from its source in the mountains through forests, cities, and finally into the sea is a fascinating story of nature's cycle and human interaction with the environment."

In [2]:
text = "The Journey of a River from its source in the mountains through forests, cities, and finally into the sea is a fascinating story of nature's cycle and human interaction with the environment."

### 1. Fixed Size Chunking
This is the simplest but computationally effective method, it splits the text into chunk based on characters, words, or tokens without considering the meaning or the structure.

Advantages:
+ Fast
+ Predictable
+ Easy to implement

Drawbacks:
- Ignores semantic structure (reduces retrieval accuracy)

In [3]:
from langchain_core.documents import Document

print("#### Manual Character Text Splitting ####")

# manual character chunking
chunks = []
chunk_size = 35

for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)
    
documents = [Document(page_content=chunk, metadata={"source": "local"}) for chunk in chunks]
print(documents)

#### Manual Character Text Splitting ####
[Document(metadata={'source': 'local'}, page_content='The Journey of a River from its sou'), Document(metadata={'source': 'local'}, page_content='rce in the mountains through forest'), Document(metadata={'source': 'local'}, page_content='s, cities, and finally into the sea'), Document(metadata={'source': 'local'}, page_content=" is a fascinating story of nature's"), Document(metadata={'source': 'local'}, page_content=' cycle and human interaction with t'), Document(metadata={'source': 'local'}, page_content='he environment.')]


Result:

[ <br/>
    'The Journey of a River from its sou', <br/>
    'rce in the mountains through forest', <br/>
    's, cities, and finally into the sea', <br/>
    " is a fascinating story of nature's", <br/>
    ' cycle and human interaction with t', <br/>
    'he environment.' <br/>
]

In [4]:
# Automatic Text Splitting

print("#### Automatic Character Text Splitting ####")
from langchain_text_splitters import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=5, separator='', strip_whitespace=False)
documents = text_splitter.create_documents([text])
print(documents)

#### Automatic Character Text Splitting ####


  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


[Document(metadata={}, page_content='The Journey of a River from its sou'), Document(metadata={}, page_content='s source in the mountains through f'), Document(metadata={}, page_content='ugh forests, cities, and finally in'), Document(metadata={}, page_content='ly into the sea is a fascinating st'), Document(metadata={}, page_content="ng story of nature's cycle and huma"), Document(metadata={}, page_content=' human interaction with the environ'), Document(metadata={}, page_content='vironment.')]


In [5]:
# print(len(result))
for i in documents:
    print(f'{i.page_content} | size = {len(i.page_content)}')

The Journey of a River from its sou | size = 35
s source in the mountains through f | size = 35
ugh forests, cities, and finally in | size = 35
ly into the sea is a fascinating st | size = 35
ng story of nature's cycle and huma | size = 35
 human interaction with the environ | size = 35
vironment. | size = 10


The result is supposed to be exactly the same, but because we add overlap value of 5 characters, each Documents entry will overlap giving us even more result which could carry a more complete information.

### 2. Recursive Character Text Splitting

This is a more dynamic approach to the original chunking method and it focuses more on the structure. It will split the text based on the separator priority. By default, the separator priority is ["\n\n", "\n", " ", ""]. What is will do is split it one by one and check whether the size is still larger than the required chunk value, if true, it will chunk it into smaller sections with the next separator.

sample text from `content.txt`:

Reiner is a student from National University of Singapore (NUS). He is taking his Master of Computing with Artificial Intelligence Specialization there.\n\nHe is 25 years old currently and have been working on some personal projects to extend his expertise and knowledge.\n\nHis personal hobby is playing basketball and playing games.

In [8]:
print("#### Recursive Character Text Splitting ####")

from langchain_text_splitters import RecursiveCharacterTextSplitter
with open('example_docs/content.txt', 'r', encoding='utf-8') as file:
    text = file.read()

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 20, chunk_overlap=0) # ["\n\n", "\n", " ", ""] 65,450
print(text_splitter.create_documents([text])) 

#### Recursive Character Text Splitting ####
[Document(metadata={}, page_content='Reiner is a student'), Document(metadata={}, page_content='from National'), Document(metadata={}, page_content='University of'), Document(metadata={}, page_content='Singapore (NUS). He'), Document(metadata={}, page_content='is taking his'), Document(metadata={}, page_content='Master of Computing'), Document(metadata={}, page_content='with Artificial'), Document(metadata={}, page_content='Intelligence'), Document(metadata={}, page_content='Specialization'), Document(metadata={}, page_content='there.'), Document(metadata={}, page_content='He is 25 years old'), Document(metadata={}, page_content='currently and have'), Document(metadata={}, page_content='been working on'), Document(metadata={}, page_content='some personal'), Document(metadata={}, page_content='projects to extend'), Document(metadata={}, page_content='his expertise and'), Document(metadata={}, page_content='knowledge.'), Document(metadata={}, 

In [9]:
result = text_splitter.create_documents([text])
# print(len(result))
for i in result:
    print(f'{i.page_content} | size = {len(i.page_content)}')

Reiner is a student | size = 19
from National | size = 13
University of | size = 13
Singapore (NUS). He | size = 19
is taking his | size = 13
Master of Computing | size = 19
with Artificial | size = 15
Intelligence | size = 12
Specialization | size = 14
there. | size = 6
He is 25 years old | size = 18
currently and have | size = 18
been working on | size = 15
some personal | size = 13
projects to extend | size = 18
his expertise and | size = 17
knowledge. | size = 10
His personal hobby | size = 18
is playing | size = 10
basketball and | size = 14
playing games. | size = 14


In [12]:
# 3. Document Specific Splitting
print("#### Document Specific Splitting ####")

# Document Specific Splitting - Markdown
from langchain_text_splitters import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)
markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""
result = splitter.create_documents([markdown_text])

#### Document Specific Splitting ####


In [13]:
# print(len(result))
for i in result:
    print(f'{i.page_content} | size = {len(i.page_content)}')

# Fun in California

## Driving | size = 31
Try driving on the 1 down to San Diego | size = 38
### Food | size = 8
Make sure to eat a burrito while you're | size = 39
there | size = 5
## Hiking

Go to Yosemite | size = 25


In [15]:
# Document Specific Splitting - Python
from langchain_text_splitters import PythonCodeTextSplitter
python_text = """
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

for i in range(10):
    print (i)
"""
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)
result = python_splitter.create_documents([python_text])

In [17]:
# print(len(result))
for i in result:
    print(f'{i.page_content} ---> size = {len(i.page_content)}')

class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age ---> size = 86
p1 = Person("John", 36)

for i in range(10):
    print (i) ---> size = 58


In [21]:
%pip install -q langchain-experimental


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Semantic Chunking

Semantic Chunking doesn't focus on the structure, it focuses on the meaning of the sections. It utilizes embedding and count the semantic similarity to split the text when the topic shifts.

Advantage:
+ Better precision, semantic chunking produces chunks that align closely with the user intent during retrieval.

Drawbacks:
- Computational Cost, this method is only suitable when accuracy is more important than speed. (example: domain-specific RAG system for legal or medical domains.)

In [None]:
# 4. Semantic Chunking
print("#### Semantic Chunking ####")

from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama.embeddings import OllamaEmbeddings

# Percentile - all differences between sentences are calculated, and then any difference greater than the X percentile is split
text_splitter = SemanticChunker(OllamaEmbeddings(model="mxbai-embed-large"))
text_splitter = SemanticChunker(
    OllamaEmbeddings(model="mxbai-embed-large"), breakpoint_threshold_type="percentile" # "standard_deviation", "interquartile"
)
documents = text_splitter.create_documents([text])

#### Semantic Chunking ####
[Document(metadata={}, page_content='Reiner is a student from National University of Singapore (NUS). He is taking his Master of Computing with Artificial Intelligence Specialization there.'), Document(metadata={}, page_content='He is 25 years old currently and have been working on some personal projects to extend his expertise and knowledge. His personal hobby is playing basketball and playing games.')]


In [22]:
# print(len(result))
for i in documents:
    print(f'{i.page_content} ---> size = {len(i.page_content)}')

Reiner is a student from National University of Singapore (NUS). He is taking his Master of Computing with Artificial Intelligence Specialization there. ---> size = 152
He is 25 years old currently and have been working on some personal projects to extend his expertise and knowledge. His personal hobby is playing basketball and playing games. ---> size = 175


# Sliding Window Chunking

This is the opposite of Semantic Chunking. Instead of keeping the semantic value, Sliding window aims to preserve the continuity of information which can increase the relevancy because multiple informations can be retrieved later. But the tradeoff here is that we are going to have redundancy of information. Having redundancy will increase the cost for storage and processing.

This can be implemented by adjusting the chunk overlap value of the RecursiveCharacterTextSplitter. We should typically use 20–50% overlap between chunks to preserve context across boundaries, especially in technical or conversational text.

In [5]:
print("#### Recursive Character Text Splitting ####")

from langchain_text_splitters import RecursiveCharacterTextSplitter
with open('example_docs/content.txt', 'r', encoding='utf-8') as file:
    text = file.read()

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 100, chunk_overlap=35) # ["\n\n", "\n", " ", ""] 65,450
print(text_splitter.create_documents([text])) 

#### Recursive Character Text Splitting ####
[Document(metadata={}, page_content='Reiner is a student from National University of Singapore (NUS). He is taking his Master of'), Document(metadata={}, page_content='(NUS). He is taking his Master of Computing with Artificial Intelligence Specialization there.'), Document(metadata={}, page_content='He is 25 years old currently and have been working on some personal projects to extend his'), Document(metadata={}, page_content='personal projects to extend his expertise and knowledge.'), Document(metadata={}, page_content='His personal hobby is playing basketball and playing games.')]


In [6]:
result = text_splitter.create_documents([text])
# print(len(result))
for i in result:
    print(f'{i.page_content} | size = {len(i.page_content)}')

Reiner is a student from National University of Singapore (NUS). He is taking his Master of | size = 91
(NUS). He is taking his Master of Computing with Artificial Intelligence Specialization there. | size = 94
He is 25 years old currently and have been working on some personal projects to extend his | size = 90
personal projects to extend his expertise and knowledge. | size = 56
His personal hobby is playing basketball and playing games. | size = 59


### Hierarchical and Contextual Chunking

These chunking methods is used when continuity is not enough and the document structer has to be preserved.

**Hierarchical Chunking** preserves the full structure of the document, from sections down to sentences. Instead of producing a flat list of chunks, it builds a tree that reflects the original hierarchy. Each chunk has a parent–child relationship with the levels above and below it. For example, a section contains multiple paragraphs (parent → children), and each paragraph may contain multiple sentences.

During retrieval, this structure enables flexible navigation. If a query matches a sentence-level chunk, the system can expand upward to provide additional context from its parent paragraph or even the entire section. Conversely, if a broad query matches a section-level chunk, the system can drill down into the most relevant child paragraph or sentence. This multi-level retrieval improves both precision and recall, since the model can adapt the scope of the returned content.

**Contextual chunking** goes a step further by enriching chunks with metadata such as headings, timestamps, or source references. This additional information provides important signals that help retrieval systems disambiguate results. For instance, two documents may contain nearly identical sentences, but their section titles or timestamps can determine which one is more relevant to a query. Metadata also makes it easier to trace answers back to their source, which is particularly valuable in regulated or compliance-driven domains.

Advantage:
+ both methods are accurate and flexible.

Drawbacks:
- Added complexity in both preprocessing and retrieval logic, since the system must manage relationships between chunks instead of treating them as independent units.

These approaches might be suitable for domains like legal contracts, financial reports, or technical specifications, where preserving structure and traceability is essential.

### Topic-based and Modality-Specific Chunking

These methods are the more flexible ways to group related content, because the structure or the hierarchy of all documents cannot be exactly identical.

*Topic Based Chunking* groups text by thematic units using algorithms such as Latent Dirichlet Allocations (LDA) which is an embedding based clustering methods to identify semantic boundaries.

Instead of fixed sizes or structural markers, the goal is to keep all content related to a theme in one place. This approach works well for long-form content such as research reports or articles that shift between distinct subjects. Because each chunk stays focused on a single theme, retrieval results are more aligned with user intent and less likely to include unrelated material.

*Modality Specific Chunking* adapts strategies to different content types to ensure the information is segmented to respect the structure of the medium.