In [1]:
docs = [
    # DOC 1: Long narrative paragraph (will break paragraph-level splitting)
    """The French Revolution began in 1789 as a result of deep social and economic inequality.
The French society was divided into three estates, where the clergy and nobility enjoyed privileges, while the common people bore heavy taxation and hardship. Rising bread prices, state debt, and the influence of Enlightenment ideas further fueled public anger. The calling of the Estates-General eventually led to the formation of the National Assembly.""",

    # DOC 2: Structured but long paragraph with no blank lines
    """The Industrial Revolution marked a major turning point in human history, influencing almost every aspect of daily life through mechanization and technological innovation.

Production shifted from manual labor to machines, factories expanded rapidly, and urbanization increased as people moved from rural areas to industrial centers in search of work.

While industrialization generated economic growth and new opportunities, it also produced harsh working conditions, child labor, pollution, and severe social inequality.""",

    # DOC 3: Clear paragraph breaks (paragraph splitting should succeed)
    """World War I began in 1914 following the assassination of Archduke Franz Ferdinand.
It involved major world powers divided into the Allied and Central Powers.

The war introduced new forms of warfare including trenches, machine guns, and chemical weapons.
By 1918, millions had died, and the political map of Europe was permanently altered."""
]


In [2]:
for item in docs:
    print(len(item))

442
521
339


In [3]:
chunk_size = 150

`uv add langchain-text-splitters`

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

class DebugRecursiveCharacterTextSplitter(RecursiveCharacterTextSplitter):
    def _split_text(self, text, separators):
        print("\n--- RECURSION LEVEL ---")
        print(f"Trying separators: {separators}")
        print(f"Text length: {len(text)}")
        return super()._split_text(text, separators)


  from .autonotebook import tqdm as notebook_tqdm


In [5]:
splitter = DebugRecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=0
)

for i, doc in enumerate(docs):
    print(f"\n================ DOC {i+1} ================\n")
    chunks = splitter.split_text(doc)

    for j, chunk in enumerate(chunks):
        print(f"\nChunk {j+1} ({len(chunk)} chars):\n{chunk}")





--- RECURSION LEVEL ---
Trying separators: ['\n\n', '\n', ' ', '']
Text length: 442

--- RECURSION LEVEL ---
Trying separators: [' ', '']
Text length: 355

Chunk 1 (87 chars):
The French Revolution began in 1789 as a result of deep social and economic inequality.

Chunk 2 (148 chars):
The French society was divided into three estates, where the clergy and nobility enjoyed privileges, while the common people bore heavy taxation and

Chunk 3 (147 chars):
hardship. Rising bread prices, state debt, and the influence of Enlightenment ideas further fueled public anger. The calling of the Estates-General

Chunk 4 (57 chars):
eventually led to the formation of the National Assembly.



--- RECURSION LEVEL ---
Trying separators: ['\n\n', '\n', ' ', '']
Text length: 521

--- RECURSION LEVEL ---
Trying separators: ['\n', ' ', '']
Text length: 170

--- RECURSION LEVEL ---
Trying separators: ['\n', ' ', '']
Text length: 180

--- RECURSION LEVEL ---
Trying separators: [' ', '']
Text length: 179

