<h2> Here are the key chunking methods you should know for Gen AI and RAG workflows: </h2>

1. Fixed-Size Chunking
2. Sliding Window Chunking
3. Paragraph-Based Chunking
4. Token-Based Chunking
5. Semantic Chunking (via Embeddings or LLMs)
6. Recursive Character/Text Splitting (used in LangChain)
7. Metadata-Aware Chunking
8. Title + Content Chunking
9. Hybrid Chunking (combinations of the above)

Code: Fixed-Size Chunking

In [64]:
from langchain.text_splitter import CharacterTextSplitter

# Load your text
with open("../data/sample.txt", "r", encoding="utf-8") as f:
    text = f.read()
    
fixSizetextSplitter = CharacterTextSplitter(
    separator="\n",
    chunk_size = 250,
)

chunks = fixSizetextSplitter.split_text(text)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---\n{chunk}\n")
    print("#"*40)


--- Chunk 1 ---
One day, an old man was walking along a beach that was littered with thousands of starfish that had been washed ashore by the tide. As he walked, he saw a young boy in the distance, picking something up and gently throwing it into the ocean.

########################################
--- Chunk 2 ---
As the man approached, he called out, “Good morning! May I ask what you are doing?”

########################################
--- Chunk 3 ---
The boy paused, looked up, and replied, “I’m throwing starfish back into the ocean. The tide has washed them up and they can’t return to the sea by themselves. If I don’t throw them back, they’ll die.”

########################################
--- Chunk 4 ---
The old man replied, “But there must be tens of thousands of starfish on this beach. I’m afraid you won’t really be able to make much of a difference.”

########################################
--- Chunk 5 ---
The boy bent down, picked up another starfish, and threw it as far as he

Code: Sliding Window Chunking

In [77]:
from langchain.text_splitter import CharacterTextSplitter

# Load your text
with open("../data/no_newline.txt", "r", encoding="utf-8") as f:
    text = f.read()
    
fixSizetextSplitter = CharacterTextSplitter(
    separator=" ",
    chunk_size = 250,
    chunk_overlap = 20
)

chunks = fixSizetextSplitter.split_text(text)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---\n{chunk}\n")
    print("#"*40)


--- Chunk 1 ---
Lena was a curious girl who loved exploring the woods behind her house every afternoon she would pack a small bag with snacks a notebook and her favorite compass one day she discovered a trail she had never seen before it led her to a hidden pond

########################################
--- Chunk 2 ---
her to a hidden pond with crystal clear water and colorful fish swimming near the surface excited she began sketching what she saw in her notebook suddenly a small turtle climbed out of the water and looked at her as if it wanted to say something she

########################################
--- Chunk 3 ---
to say something she laughed and said hello little one this became her secret spot a magical place she would return to again and again.

########################################


Paragraph-Based Chunking

In [83]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load your text
with open("../data/sample.txt", "r", encoding="utf-8") as f:
    text = f.read()
    
sentenceTextsplit = RecursiveCharacterTextSplitter(
    chunk_size=200,    
    chunk_overlap=20,
    separators=["\n"]  # Split at periods and newlines for sentences and paragraphs
)

# Split text into chunks
chunks = sentenceTextsplit.split_text(text)

# Print sample chunks
for i, chunk in enumerate(chunks[:3]):
    print(f"--- Chunk {i+1} ---\n{chunk}\n")

--- Chunk 1 ---
One day, an old man was walking along a beach that was littered with thousands of starfish that had been washed ashore by the tide. As he walked, he saw a young boy in the distance, picking something up and gently throwing it into the ocean.

--- Chunk 2 ---
As the man approached, he called out, “Good morning! May I ask what you are doing?”

--- Chunk 3 ---

The boy paused, looked up, and replied, “I’m throwing starfish back into the ocean. The tide has washed them up and they can’t return to the sea by themselves. If I don’t throw them back, they’ll die.”



Token-Based Chunking

In [88]:
from langchain.text_splitter import CharacterTextSplitter
import tiktoken

# Load text from file
with open("../data/sample.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Initialize tiktoken encoder (for GPT-style tokenization)
enc = tiktoken.get_encoding("gpt2")

# Tokenize the text
tokens = enc.encode(text)

# Set the chunk size in terms of tokens (e.g., 100 tokens per chunk)
chunk_size = 100

# Create chunks based on token count
chunks = [tokens[i:i + chunk_size] for i in range(0, len(tokens), chunk_size)]

# Convert token chunks back to text (optional)
chunk_texts = [enc.decode(chunk) for chunk in chunks]

# Print first 3 chunks
for i, chunk in enumerate(chunk_texts[:3]):
    print(f"--- Chunk {i+1} ---\n{chunk}\n")


--- Chunk 1 ---
One day, an old man was walking along a beach that was littered with thousands of starfish that had been washed ashore by the tide. As he walked, he saw a young boy in the distance, picking something up and gently throwing it into the ocean.

As the man approached, he called out, “Good morning! May I ask what you are doing?”

The boy paused, looked up, and replied, “I’m throwing starfish back

--- Chunk 2 ---
 into the ocean. The tide has washed them up and they can’t return to the sea by themselves. If I don’t throw them back, they’ll die.”

The old man replied, “But there must be tens of thousands of starfish on this beach. I’m afraid you won’t really be able to make much of a difference.”

The boy bent down, picked up another starfish, and

--- Chunk 3 ---
 threw it as far as he could into the ocean. Then he turned, smiled, and said, “I made a difference to that one.”

The man looked at the boy, thought for a moment, and then began throwing starfish back into the sea

markdown chunking

In [90]:
import re

with open("..\data\markdown.md", "r", encoding="utf-8") as f:
    md_text = f.read()

# Split at Markdown headings (lines starting with one or more #)
chunks = re.split(r'\n#+ ', md_text)

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---\n{chunk.strip()}\n")


--- Chunk 1 ---
# Introduction
This is an overview of semantic chunking in text processing.

--- Chunk 2 ---
Why Chunking Matters
Chunking helps large language models process information more efficiently.

--- Chunk 3 ---
Types of Chunking
- Fixed-size chunking
- Sentence-based chunking
- Token-based chunking



  with open("..\data\markdown.md", "r", encoding="utf-8") as f:


Code: Fixed-Size Chunking using LangChain's CharacterTextSplitter

In [26]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load your text
with open("../data/sample.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Initialize CharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n", " ", ""],      # split on line breaks
    chunk_size=250,      # fixed size in characters
    chunk_overlap=50     # optional: some overlap between chunks
)

# Split text
chunks = text_splitter.split_text(text)

# Print some sample chunks
for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i+1} ---\n{chunk}\n")


--- Chunk 1 ---
One day, an old man was walking along a beach that was littered with thousands of starfish that had been washed ashore by the tide. As he walked, he saw a young boy in the distance, picking something up and gently throwing it into the ocean.

--- Chunk 2 ---
As the man approached, he called out, “Good morning! May I ask what you are doing?” The boy paused, looked up, and replied, “I’m throwing starfish back into the ocean. The tide has washed them up and they can’t return to the sea by themselves. If I

--- Chunk 3 ---
they can’t return to the sea by themselves. If I don’t throw them back, they’ll die.”

--- Chunk 4 ---
The old man replied, “But there must be tens of thousands of starfish on this beach. I’m afraid you won’t really be able to make much of a difference.”

--- Chunk 5 ---
The boy bent down, picked up another starfish, and threw it as far as he could into the ocean. Then he turned, smiled, and said, “I made a difference to that one.”

--- Chunk 6 ---
The ma