# LangChain Text Splitters – A Hands-On Guide

This repository demonstrates how **LangChain Text Splitters** are used to divide large documents into smaller, meaningful chunks for better embeddings and retrieval in **Generative AI pipelines**.

## Why Text Splitting?

LLMs (like GPT-5, Groq, Gemini or Claude) have input size limits (token windows). Large documents can exceed those limits, causing:
- Lost context or truncated responses  
- Poor embedding quality  
- Slow and inefficient retrieval  

Text splitters solve this by **dividing text into overlapping, context-preserving chunks** before embedding or retrieval.





# 1. TokenTextSplitter 
### It splits the data by token count

In [10]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import TokenTextSplitter

loader = TextLoader("./speech.txt")
docs = loader.load()

text_splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=25)

texts = text_splitter.split_documents(docs)
texts



[Document(metadata={'source': './speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight'),
 Document(metadata={'source': './speech.txt'}, page_content=' those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the pr

# 2. CharacterTextSplitter - The text is split based on the number of characters

In [18]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter=CharacterTextSplitter(
    separator='\n',
    chunk_size=75,
    chunk_overlap=15
)

texts=text_splitter.split_documents(docs)

"""Text will be split only at new lines since we are using the new line (“\n”) as the separator.
 If any chunk has a size more than 75 but no new lines in it, it will be returned as such."""

texts

Created a chunk of size 470, which is longer than the specified 75
Created a chunk of size 347, which is longer than the specified 75
Created a chunk of size 668, which is longer than the specified 75
Created a chunk of size 982, which is longer than the specified 75
Created a chunk of size 789, which is longer than the specified 75


[Document(metadata={'source': './speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'),
 Document(metadata={'source': './speech.txt'}, page_content='Just because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.'),
 Document(metadata={'source': './speech.txt'}, page_c

# 3. RecursiveCharacterTextSplitter

### This method uses multiple separators recursively to split the data until the chunk reaches the less than the chunk_size

In [25]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(
    separators=['\n\n',r"(?<=[.,])\s+"],
    chunk_size=100,
    chunk_overlap=20,
    is_separator_regex=True,
    keep_separator=False
)

texts=text_splitter.split_documents(docs)
texts

[Document(metadata={'source': './speech.txt'}, page_content='The world must be made safe for democracy.'),
 Document(metadata={'source': './speech.txt'}, page_content='Its peace must be planted upon the tested foundations of political liberty.'),
 Document(metadata={'source': './speech.txt'}, page_content='We have no selfish ends to serve.(?<=[.,])\\s+We desire no conquest,(?<=[.,])\\s+no dominion.'),
 Document(metadata={'source': './speech.txt'}, page_content='no dominion.(?<=[.,])\\s+We seek no indemnities for ourselves,'),
 Document(metadata={'source': './speech.txt'}, page_content='no material compensation for the sacrifices we shall freely make.'),
 Document(metadata={'source': './speech.txt'}, page_content='We are but one of the champions of the rights of mankind.'),
 Document(metadata={'source': './speech.txt'}, page_content='We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'),
 Document(metadata={'source': '.

# 4. MarkdownHeaderTextSplitter 
### It is used when the text follows a structured Markdown format —  
### like articles, documentation, or hierarchical notes (with `#`, `##`, `###` headers, etc.).

In [None]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_text = """
# LangChain TextSplitters
Text Splitters are most important elements the langchain pipeline which divides the data in meaningful chunks.
## Types of TextSplitters
1.  Length-Based
2. Text Structured Based
3. Document Structured Based
4. Semantic Meaning Based
### Reasons to Splt documents
1. Handling non-uniform document lengths
2. Overcoming model limitations
3. Improving representaion quality
4. Optimizing Computational Resources
"""

splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "Header 1"), ("##", "Header 2"),('###','Header 3')]
)

chunks = splitter.split_text(markdown_text)
chunks

[Document(metadata={'Header 1': 'LangChain TextSplitters'}, page_content='Text Splitters are most important elements the langchain pipeline which divides the data in meaningful chunks.'),
 Document(metadata={'Header 1': 'LangChain TextSplitters', 'Header 2': 'Types of TextSplitters'}, page_content='1.  Length-Based\n2. Text Structured Based\n3. Document Structured Based\n4. Semantic Meaning Based'),
 Document(metadata={'Header 1': 'LangChain TextSplitters', 'Header 2': 'Types of TextSplitters', 'Header 3': 'Reasons to Splt documents'}, page_content='1. Handling non-uniform document lengths\n2. Overcoming model limitations\n3. Improving representaion quality\n4. Optimizing Computational Resources')]

# 5. HTMLHeaderTextSplitter
### It is a "structure-aware" text splitter that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk


Choosing the Right Splitter
- HTMLHeaderTextSplitter:When You need to split an HTML document based on its header hierarchy and maintain metadata about the headers.
- Use HTMLSectionSplitter when: You need to split the document into larger, more general sections, possibly based on custom tags or font sizes.
- Use HTMLSemanticPreservingSplitter when: You need to split the document into chunks while preserving semantic elements like tables and lists, ensuring that they are not split and that their context is maintained.

In [44]:

from langchain_text_splitters import HTMLHeaderTextSplitter,HTMLSectionSplitter,HTMLSemanticPreservingSplitter

""" Splits HTML text based on header tags (e.g.,<h1>, <h2>, <h3>, etc.),
 and adds metadata for each header relevant to any given chunk"""

url='https://python.langchain.com/docs/how_to/split_html/'

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)

# for local file use html_splitter.split_text_from_file(<path_to_file>)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits

[Document(metadata={}, page_content='!function(){function t(t){document.documentElement.setAttribute("data-theme",t)}var e=function(){try{return new URLSearchParams(window.location.search).get("docusaurus-theme")}catch(t){}}()||function(){try{return window.localStorage.getItem("theme")}catch(t){}}();null!==e?t(e):window.matchMedia("(prefers-color-scheme: dark)").matches?t("dark"):(window.matchMedia("(prefers-color-scheme: light)").matches,t("light"))}(),function(){try{const n=new URLSearchParams(window.location.search).entries();for(var[t,e]of n)if(t.startsWith("docusaurus-data-")){var a=t.replace("docusaurus-data-","data-");document.documentElement.setAttribute(a,e)}}catch(t){}}(),document.documentElement.setAttribute("data-announcement-bar-initially-dismissed",function(){try{return"true"===localStorage.getItem("docusaurus.announcement.dismiss")}catch(t){}return!1}())  \nSkip to main content  \n⚠️ THESE DOCS ARE OUTDATED.  \nVisit the new v1.0 docs  \nIntegrations  \nAPI Reference  \n

# 6. RecursiveJsonSplitter
### It is used to split nested JSON data into smaller, manageable chunks while preserving hierarchical structure and context.

In [43]:
from langchain.text_splitter import RecursiveJsonSplitter

# Sample nested JSON data
university_data = {
    "university": "Sheffield Hallam University",
    "location": "Sheffield, United Kingdom",
    "departments": [
        {
            "name": "Computing and AI",
            "courses": [
                {
                    "course_name": "MSc Big Data Analytics",
                    "modules": [
                        {"name": "Machine Learning", "credits": 20},
                        {"name": "Deep Learning", "credits": 20},
                        {"name": "Big Data Engineering", "credits": 20}
                    ]
                },
                {
                    "course_name": "MSc Cyber Security",
                    "modules": [
                        {"name": "Network Security", "credits": 20},
                        {"name": "Cryptography", "credits": 20}
                    ]
                }
            ]
        },
        {
            "name": "Business and Management",
            "courses": [
                {
                    "course_name": "MBA International Business",
                    "modules": [
                        {"name": "Global Strategy", "credits": 20},
                        {"name": "Leadership", "credits": 20}
                    ]
                }
            ]
        }
    ]
}

# Create RecursiveJsonSplitter instance
splitter = RecursiveJsonSplitter(max_chunk_size=200)

# Split JSON into smaller structured chunks
chunks = splitter.split_json(university_data,convert_lists=True)

# Print results
for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx+1}:\n{chunk}\n")


Chunk 1:
{'university': 'Sheffield Hallam University', 'location': 'Sheffield, United Kingdom'}

Chunk 2:
{'departments': {'0': {'name': 'Computing and AI'}}}

Chunk 3:
{'departments': {'0': {'courses': {'0': {'course_name': 'MSc Big Data Analytics'}}}}}

Chunk 4:
{'departments': {'0': {'courses': {'0': {'modules': {'0': {'name': 'Machine Learning', 'credits': 20}, '1': {'name': 'Deep Learning', 'credits': 20}}}}}}}

Chunk 5:
{'departments': {'0': {'courses': {'0': {'modules': {'2': {'name': 'Big Data Engineering', 'credits': 20}}}}}}}

Chunk 6:
{'departments': {'0': {'courses': {'1': {'course_name': 'MSc Cyber Security', 'modules': {'0': {'name': 'Network Security', 'credits': 20}, '1': {'name': 'Cryptography', 'credits': 20}}}}}}}

Chunk 7:
{'departments': {'1': {'name': 'Business and Management'}}}

Chunk 8:
{'departments': {'1': {'courses': {'0': {'course_name': 'MBA International Business', 'modules': {'0': {'name': 'Global Strategy', 'credits': 20}, '1': {'name': 'Leadership', 'c