### Text splitting techniques

## LangChain's TextSplitter turns a list of documents (or long strings) into smaller, overlapping text chunks
Common Classes in langchain.text_splitter
| Splitter Class                   | Use Case                                             |
| -------------------------------- | ---------------------------------------------------- |
| `RecursiveCharacterTextSplitter` | Smart default – splits by paragraphs, then sentences |
| `CharacterTextSplitter`          | Splits based on a character (like "\n" or space)     |
| `TokenTextSplitter`              | Splits by tokens (based on tokenizer like tiktoken)  |
| `MarkdownTextSplitter`           | Specialized for Markdown formatting                  |
| `SpacyTextSplitter`              | Uses spaCy for sentence-based splitting              |


## RecursiveCharacterTextSplitter
#### Under the Hood: Recursive Splitting
RecursiveCharacterTextSplitter tries to split on large natural boundaries like:

Paragraphs (\n\n)

Sentences (.)

Words ( )

Characters

It recursively backs off to smaller boundaries if needed.

#### Why Overlap?
Overlap helps the LLM retain context across chunks. For example, if a paragraph spans two chunks, the overlap ensures both chunks include enough of it to make sense.

In [44]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("example.txt")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
chunks = splitter.split_documents(documents)

for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---\n")
    print(chunk.page_content)


--- Chunk 1 ---

LangChain is a framework designed to help developers build applications powered by language models. It provides tools and abstractions for working with large language models like GPT-4.

--- Chunk 2 ---

One of the core challenges in using LLMs is managing the context window â€” models can only take in a limited number of tokens. LangChain solves this by offering tools like document loaders, text

--- Chunk 3 ---

like document loaders, text splitters, and memory modules.

--- Chunk 4 ---

Text splitters are particularly useful when working with long documents. They break down text into chunks that fit within the LLMâ€™s context window, optionally with overlapping content for

--- Chunk 5 ---

with overlapping content for continuity.

--- Chunk 6 ---

LangChain also supports retrieval-based question answering (RAG), where external documents are converted into embeddings, stored in a vector database, and queried based on similarity to a user's

--- Chunk 7 ---

on simi

### CharacterTextSplitter
Splits based on a character (like "\n" or space) 

In [45]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
loader = TextLoader("example.txt")
documents = loader.load()

splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=300,
    chunk_overlap=50
)

chunks = splitter.split_documents(documents)

for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---\n")
    print(chunk.page_content)


--- Chunk 1 ---

LangChain is a framework designed to help developers build applications powered by language models. It provides tools and abstractions for working with large language models like GPT-4.

--- Chunk 2 ---

One of the core challenges in using LLMs is managing the context window â€” models can only take in a limited number of tokens. LangChain solves this by offering tools like document loaders, text splitters, and memory modules.

--- Chunk 3 ---

Text splitters are particularly useful when working with long documents. They break down text into chunks that fit within the LLMâ€™s context window, optionally with overlapping content for continuity.

--- Chunk 4 ---

LangChain also supports retrieval-based question answering (RAG), where external documents are converted into embeddings, stored in a vector database, and queried based on similarity to a user's question.

--- Chunk 5 ---

With LangChain, developers can build chatbots, summarization tools, document Q&A apps, and

### TokenTextSplitter
Useful when you want chunk sizes to match LLM token limits

Helps avoid exceeding model input size

Works with tokenizers like tiktoken (used by OpenAI models)

In [46]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import TokenTextSplitter

loader = TextLoader("example.txt")
documents = loader.load()

splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=20)
chunks = splitter.split_documents(documents)

for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---\n")
    print(chunk.page_content)


--- Chunk 1 ---

LangChain is a framework designed to help developers build applications powered by language models. It provides tools and abstractions for working with large language models like GPT-4.

One of the core challenges in using LLMs is managing the context window â€” models can only take in a limited number of tokens. LangChain solves this by offering tools like document loaders, text splitters, and memory modules.

Text splitters are particularly useful when working with long documents

--- Chunk 2 ---

 splitters, and memory modules.

Text splitters are particularly useful when working with long documents. They break down text into chunks that fit within the LLMâ€™s context window, optionally with overlapping content for continuity.

LangChain also supports retrieval-based question answering (RAG), where external documents are converted into embeddings, stored in a vector database, and queried based on similarity to a user's question.

With LangChain, developers can buil

### MarkdownTextSplitter
which is specifically designed to split Markdown documents based on their headings and structure 

In [47]:
from langchain.text_splitter import MarkdownTextSplitter
from langchain_community.document_loaders import TextLoader

# Load the markdown file as a single document
loader = TextLoader("markdown_example.md")
documents = loader.load()

# Initialize Markdown splitter
splitter = MarkdownTextSplitter(chunk_size=300, chunk_overlap=0)

# Split the document
chunks = splitter.split_documents(documents)

# Print chunks
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk.page_content)


--- Chunk 1 ---
# LangChain Overview

LangChain is a framework to build LLM-powered applications.

## Features

- Document loading
- Text splitting
- Embedding and vector search

## Use Cases

### Question Answering

Build Q&A bots over your documents.

### Summarization

Summarize long texts using LLMs.

--- Chunk 2 ---
# Conclusion

LangChain simplifies building LLM pipelines.


### SpacyTextSplitter
Splits text at linguistic boundaries (sentences), not characters or fixed lengths,
Preserves natural language flow, improving LLM understanding,
Useful for tasks like summarization, translation, Q&A, etc.

In [48]:
from langchain.text_splitter import SpacyTextSplitter

# Sample long text
text = """
LangChain is an open-source framework to build applications using large language models.
It provides utilities for text loading, splitting, retrieval, and memory.
This helps developers create powerful AI tools with minimal effort.
"""

# Create a SpacyTextSplitter
splitter = SpacyTextSplitter(chunk_size=30, chunk_overlap=5)

# Split the text (returns list of strings)
chunks = splitter.split_text(text)

# Show the chunks
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk)

Created a chunk of size 90, which is longer than the specified 30
Created a chunk of size 74, which is longer than the specified 30



--- Chunk 1 ---
LangChain is an open-source framework to build applications using large language models.

--- Chunk 2 ---
It provides utilities for text loading, splitting, retrieval, and memory.

--- Chunk 3 ---
This helps developers create powerful AI tools with minimal effort.


### HTMLHeaderTextSplitter

Splits HTML content based on logical sections\n
Retains structure for summarization, indexing, or retrieval\n
Great for crawling and analyzing web pages with LLMs

In [49]:
from langchain.text_splitter import HTMLHeaderTextSplitter

# Define the header tags you want to split on
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3")
]

# Sample HTML as string
html_string = """
<html>
  <body>
    <h1>LangChain Overview</h1>
    <p>LangChain helps build LLM-based applications.</p>
    <h2>Key Features</h2>
    <p>It supports document loading, splitting, and retrieval.</p>
    <h2>Use Cases</h2>
    <h3>Chatbots</h3>
    <p>Build context-aware chatbots over custom data.</p>
    <h3>Summarization</h3>
    <p>Generate summaries from long documents.</p>
  </body>
</html>
"""

# Initialize the HTML splitter
splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Split the HTML into LangChain documents
docs = splitter.split_text(html_string)

# Print results
for i, doc in enumerate(docs):
    print(f"\n--- Chunk {i+1} ---")
    print(f"Metadata: {doc.metadata}")
    print(doc.page_content)



--- Chunk 1 ---
Metadata: {'Header 1': 'LangChain Overview'}
LangChain Overview

--- Chunk 2 ---
Metadata: {'Header 1': 'LangChain Overview'}
LangChain helps build LLM-based applications.

--- Chunk 3 ---
Metadata: {'Header 1': 'LangChain Overview', 'Header 2': 'Key Features'}
Key Features

--- Chunk 4 ---
Metadata: {'Header 1': 'LangChain Overview', 'Header 2': 'Key Features'}
It supports document loading, splitting, and retrieval.

--- Chunk 5 ---
Metadata: {'Header 1': 'LangChain Overview', 'Header 2': 'Use Cases'}
Use Cases

--- Chunk 6 ---
Metadata: {'Header 1': 'LangChain Overview', 'Header 2': 'Use Cases', 'Header 3': 'Chatbots'}
Chatbots

--- Chunk 7 ---
Metadata: {'Header 1': 'LangChain Overview', 'Header 2': 'Use Cases', 'Header 3': 'Chatbots'}
Build context-aware chatbots over custom data.

--- Chunk 8 ---
Metadata: {'Header 1': 'LangChain Overview', 'Header 2': 'Use Cases', 'Header 3': 'Summarization'}
Summarization

--- Chunk 9 ---
Metadata: {'Header 1': 'LangChain Overvi

### RecursiveJsonSplitter or JSONStringSplitter
| Splitter                | Description                                           |
| ----------------------- | ----------------------------------------------------- |
| `RecursiveJsonSplitter` | Splits deeply nested Python dicts/lists intelligently |


### RecursiveJsonSplitter

In [50]:
from langchain.text_splitter import RecursiveJsonSplitter
# Example JSON data (Python dict format)
json_data = {
    "title": "LangChain Overview",
    "sections": [
        {
            "heading": "Introduction",
            "content": "LangChain helps you build LLM-powered apps."
        },
        {
            "heading": "Features",
            "content": "It supports document loaders, text splitters, and retrieval."
        },
        {
            "heading": "Conclusion",
            "content": "LangChain simplifies working with language models."
        }
    ]
}

# Create splitter
splitter = RecursiveJsonSplitter(max_chunk_size=100) 
# Split JSON into chunks (returns list of dicts)
chunks = splitter.split_json(json_data)

# Show chunks
for i, chunk in enumerate(chunks):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk)


--- Chunk 1 ---
{'title': 'LangChain Overview', 'sections': [{'heading': 'Introduction', 'content': 'LangChain helps you build LLM-powered apps.'}, {'heading': 'Features', 'content': 'It supports document loaders, text splitters, and retrieval.'}, {'heading': 'Conclusion', 'content': 'LangChain simplifies working with language models.'}]}
