# 1. CharacterTextSplitter

Best for: simple, deterministic character-based splitting (fast and predictable).

In [2]:

text = """
It is a truth universally acknowledged, that a single man in possession of a good fortune,
must be in want of a wife. However little known the feelings or views of such a man may be
on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families,
that he is considered the rightful property of some one or other of their daughters.

“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?”
Mr. Bennet replied that he had not.
“But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.”

Mr. Bennet made no answer.
“Do you not want to know who has taken it?” cried his wife impatiently.
“_You_ want to tell me, and I have no objection to hearing it.”
This was invitation enough.

“Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man
of large fortune from the north of England; that he came down on Monday in a chaise and four
to see the place, and was so much delighted with it, that he agreed with Mr. Morris immediately;
that he is to take possession before Michaelmas, and some of his servants are to be in the house by the end of next week.”

“What is his name?”
“Bingley.”
“Is he married or single?”
“Oh! single, my dear, to be sure! A single man of large fortune; four or five thousand a year.
What a fine thing for our girls!”
"""

"\n\n" → Paragraph breaks

"\n" → Line breaks (poetry, code, short sentences)

" " → Space (word-level splitting if larger breaks aren’t available)

"" → Character-level fallback (last resort)

In [10]:
from langchain.text_splitter import CharacterTextSplitter


splitter = CharacterTextSplitter(
    separator="\n\n",   
    chunk_size=400,
    chunk_overlap=30,
)

chunks = splitter.split_text(text)
print(len(chunks))
print("Chunk 1:",chunks[0],"\n\n")
print("Chunk 2:",chunks[1],"\n\n")
print("Chunk 3:",chunks[2],"\n\n")
print("Chunk 4:",chunks[3],"\n\n")



5
Chunk 1: It is a truth universally acknowledged, that a single man in possession of a good fortune,
must be in want of a wife. However little known the feelings or views of such a man may be
on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families,
that he is considered the rightful property of some one or other of their daughters. 


Chunk 2: “My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?”
Mr. Bennet replied that he had not.
“But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” 


Chunk 3: Mr. Bennet made no answer.
“Do you not want to know who has taken it?” cried his wife impatiently.
“_You_ want to tell me, and I have no objection to hearing it.”
This was invitation enough. 


Chunk 4: “Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man
of large fortune from the north of England; that he came down on

# 2.RecursiveCharacterTextSplitter

Best for: splitting long documents while preserving logical boundaries (tries separators in order, recursively reducing chunk size).

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,        # target max characters per chunk
    chunk_overlap=30,      # overlap between adjacent chunks
    separators=["\n\n", "\n", " ", ""],  # try these separators in order
)

chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
print("Chunk 1:",chunks[0],"\n\n")
print("Chunk 2:",chunks[1],"\n\n")
print("Chunk 3:",chunks[2],"\n\n")
print("Chunk 4:",chunks[3],"\n\n")

Created 5 chunks
Chunk 1: It is a truth universally acknowledged, that a single man in possession of a good fortune,
must be in want of a wife. However little known the feelings or views of such a man may be
on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families,
that he is considered the rightful property of some one or other of their daughters. 


Chunk 2: “My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?”
Mr. Bennet replied that he had not.
“But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” 


Chunk 3: Mr. Bennet made no answer.
“Do you not want to know who has taken it?” cried his wife impatiently.
“_You_ want to tell me, and I have no objection to hearing it.”
This was invitation enough. 


Chunk 4: “Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man
of large fortune from the north of England; that 

# 3. TokenTextSplitter

Best for: splitting by model tokens (preferred when you care about token limits for LLM prompts). Requires a token encoder (e.g., tiktoken).

In [None]:
from langchain.text_splitter import TokenTextSplitter


splitter = TokenTextSplitter(
    encoding_name="gpt2",   # or "cl100k_base" for OpenAI's newer encodings
    chunk_size=300,         # tokens per chunk
    chunk_overlap=30,
)

chunks = splitter.split_text(text)
print(f"{len(chunks)} token-aware chunks")
print("Chunk 1:",chunks[0],"\n\n")
print("Chunk 2:",chunks[1],"\n\n")

2 token-aware chunks
Chunk 1: 
It is a truth universally acknowledged, that a single man in possession of a good fortune,
must be in want of a wife. However little known the feelings or views of such a man may be
on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families,
that he is considered the rightful property of some one or other of their daughters.

“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?”
Mr. Bennet replied that he had not.
“But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.”

Mr. Bennet made no answer.
“Do you not want to know who has taken it?” cried his wife impatiently.
“_You_ want to tell me, and I have no objection to hearing it.”
This was invitation enough.

“Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man
of large fortune from the north of England; that he came down on Monday in a 

# 4. MarkdownTextSplitter

Best for: markdown documents — preserves headings/blocks and splits along markdown structure.

In [18]:
from langchain.text_splitter import MarkdownTextSplitter


# Example: Realistic Markdown text
md_text = """
# Project Documentation

Welcome to the **Project X** documentation.  
This document provides details about setup, usage, and deployment.

---

## 1. Introduction

Project X is an AI-powered system for automating document processing.  
It supports text extraction, summarization, and search.

## 2. Installation

Follow these steps:

1. Clone the repo:
   ```bash
   git clone https://github.com/example/project-x.git
   cd project-x
"""

splitter = MarkdownTextSplitter(
    chunk_size=200,
    chunk_overlap=20,
)

chunks = splitter.split_text(md_text)
for i, c in enumerate(chunks[:3]):
    print(f"--- CHUNK {i} ---")
    print(c)


--- CHUNK 0 ---
# Project Documentation

Welcome to the **Project X** documentation.  
This document provides details about setup, usage, and deployment.

---

## 1. Introduction
--- CHUNK 1 ---
## 1. Introduction

Project X is an AI-powered system for automating document processing.  
It supports text extraction, summarization, and search.

## 2. Installation

Follow these steps:
--- CHUNK 2 ---
1. Clone the repo:
   ```bash
   git clone https://github.com/example/project-x.git
   cd project-x


# 5. PythonCodeTextSplitter

Splits Python code into logical blocks (functions, classes, etc.).

In [20]:
from langchain.text_splitter import PythonCodeTextSplitter

code = """
def add(a, b):
    return a + b

def subtract(a, b):
    return a - b

def divide(a, b):
    if b == 0:
        raise ValueError("Division by zero")
    return a / b

def power(base, exp):
    return base ** exp

class Calculator:
    def multiply(self, x, y):
        return x * y

    def factorial(self, n):
        if n == 0 or n == 1:
            return 1
        return n * self.factorial(n-1)

    def fibonacci(self, n):
        if n <= 0:
            return []
        elif n == 1:
            return [0]
        elif n == 2:
            return [0, 1]
        seq = [0, 1]
        for i in range(2, n):
            seq.append(seq[i-1] + seq[i-2])
        return seq

class Geometry:
    def area_circle(self, r):
        from math import pi
        return pi * r * r

    def area_square(self, s):
        return s * s

    def area_rectangle(self, l, w):
        return l * w
"""

# Split code
splitter = PythonCodeTextSplitter(chunk_size=120, chunk_overlap=30)
chunks = splitter.split_text(code)

print(f"Created {len(chunks)} chunks\n")
for i, c in enumerate(chunks, start=1):
    print(f"--- CHUNK {i} ---\n{c}\n")


Created 10 chunks

--- CHUNK 1 ---
def add(a, b):
    return a + b

def subtract(a, b):
    return a - b

--- CHUNK 2 ---
def divide(a, b):
    if b == 0:
        raise ValueError("Division by zero")
    return a / b

--- CHUNK 3 ---
def power(base, exp):
    return base ** exp

--- CHUNK 4 ---
class Calculator:
    def multiply(self, x, y):
        return x * y

--- CHUNK 5 ---
def factorial(self, n):
        if n == 0 or n == 1:
            return 1
        return n * self.factorial(n-1)

--- CHUNK 6 ---
def fibonacci(self, n):
        if n <= 0:
            return []
        elif n == 1:
            return [0]

--- CHUNK 7 ---
return [0]
        elif n == 2:
            return [0, 1]
        seq = [0, 1]

--- CHUNK 8 ---
seq = [0, 1]
        for i in range(2, n):
            seq.append(seq[i-1] + seq[i-2])
        return seq

--- CHUNK 9 ---
class Geometry:
    def area_circle(self, r):
        from math import pi
        return pi * r * r

--- CHUNK 10 ---
def area_square(self, s):