Mastering Text Splitting in Langchain

Retrieval-Augmented Generation (RAG) has emerged as a powerful technique for enhancing the accuracy and relevance of AI-generated responses. At the heart of RAG lies a crucial step: “text splitting”. This process involves breaking down large documents into smaller, manageable chunks that can be efficiently processed and retrieved.

Langchain, a popular framework for developing applications with large language models (LLMs), offers a variety of text splitting techniques. 

CharacterTextSplitter: The Simple Solution
The CharacterTextSplitter is the most basic text splitting technique in Langchain. It divides text based on a specified number of characters, making it suitable for simple, uniform text splitting tasks.



LLMs have limits on context window size in terms of token numbers. Even if the context size is infinite, more input tokens will lead to higher costs, and money is not infinite.

Text will be split only at new lines since we are using the new line (“\n”) as the separator. 

Recursive
Rather than using a single separator, we use multiple separators. This method will use each separator sequentially to split the data until the chunk reaches less than chunk_size.

In [1]:
from langchain.text_splitter import CharacterTextSplitter

text = "Your long document text here..."

splitter = CharacterTextSplitter(
    separator="\n\n",  #used to avoid splitting in the middle of paragraphs.
    chunk_size=1000,
    chunk_overlap=200
)

chunks = splitter.split_text(text) #you can also split documents using split_documents

In [2]:
chunks

['Your long document text here...']

In [1]:
import fitz  # PyMuPDF

# Function to extract text and links from PDF
def load_pdf_with_links(pdf_file):
    pdf_document = fitz.open(pdf_file)  # Open the PDF
    pdf_text = ""
    pdf_links = []

    # Iterate through each page
    for page_num in range(pdf_document.page_count):
        page = pdf_document.load_page(page_num)  # Load individual page
        pdf_text += page.get_text("text")  # Extract text

        # Extract links from the page
        links = page.get_links()
        for link in links:
            if 'uri' in link:  # Check if it's a URI (URL)
                pdf_links.append(link['uri'])
    
    return pdf_text, pdf_links

# Usage
pdf_file_path = r"C:\Users\abdullah\projects\Langchain\Generative-AI\LangChain\RAG\data\My_CV_2.pdf"
text, links = load_pdf_with_links(pdf_file_path)

print("Extracted Text:")
print(text)

print("\nExtracted Links:")
for link in links:
    print(link)


Extracted Text:
Md Abdullah Al Hasib
MACHINE LEARNING ENGINEER
+8801741813559 |
Mail |
Abdullah Al Hasib |
Al-Hasib |
YouTube |
Medium
Gopalpur,Tangail, Bangladesh
EXPERIENCE
Machine Learning Engineer - remote (KBY-AI)
SEP 2023 – MAY 2024
• Design, develop, and optimize computer vision algorithms and models through GPU Servers
• Train, fine-tune, and optimize deep learning models and neural networks for monitoring the cow farms such
as incident detection, animal counting, feed lane detection etc
• Integrate computer vision models into larger software systems or applications and deploy them into
production environments. Engage in continuous learning and professional development activities.
Jr. ML Engineer - remote (Namespace IT)
JAN 2023 – FEB 2024
• Content Writer at aionlinecourse in Machine Learning, Deep Leanring, Computer Vision related articles
• Explore the updated technologies in the field of AI. Making projects in different domains of AI.
PERSONAL PROJECTS
License Plate Detecti

RecursiveCharacterTextSplitter: The Versatile Powerhouse

The RecursiveCharacterTextSplitter is Langchain’s most versatile text splitter. It attempts to split text on a list of characters in order, falling back to the next option if the resulting chunks are too large.

When to use:
- As a default choice for most general-purpose text splitting tasks.
- When dealing with various document types with different structures.
- To maintain semantic coherence in splits as much as possible.

This splitter tries to split on double newlines first, then single newlines, spaces, and finally individual characters if necessary.

In [3]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = "Your long document text here..."

splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

chunks = splitter.split_text(text)

. TokenTextSplitter: Precision for Token-Based Models

The TokenTextSplitter is designed to split text based on the number of tokens, which is particularly useful when working with models that have specific token limits, such as GPT-3 or any other models.

When to use:
- When working with token-sensitive models
- To ensure that chunks fit within model token limits
- For more precise control over input size for language models

In [4]:
from langchain.text_splitter import TokenTextSplitter

text = "Your long document text here..."

splitter = TokenTextSplitter(
    encoding_name="cl100k_base",  # OpenAI's encoding
    chunk_size=100,
    chunk_overlap=20
)

chunks = splitter.split_text(text)

In [5]:
chunks

['Your long document text here...']

 MarkdownHeaderTextSplitter: Structure-Aware Splitting for Markdown
The MarkdownHeaderTextSplitter is specially designed to handle Markdown documents, respecting the header hierarchy and document structure.

When to use:
- Specifically for Markdown documents
- To maintain the logical structure of documentation or articles
- When header-based organization is crucial for your RAG application 

This splitter creates chunks based on Markdown headers, preserving the document’s hierarchical structure.

In [6]:
from langchain.text_splitter import MarkdownHeaderTextSplitter

markdown_text = """
# Title
## Section 1
Content of section 1
## Section 2
Content of section 2
### Subsection 2.1
Content of subsection 2.1
"""

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
chunks = splitter.split_text(markdown_text)

PythonCodeTextSplitter: Tailored for Code Splitting

The PythonCodeTextSplitter is designed specifically for splitting Python source code, respecting function and class boundaries.

When to use:
- When working with Python codebases
- For code documentation or analysis tasks
- To maintain the integrity of code structures in your splits

In [7]:
from langchain.text_splitter import PythonCodeTextSplitter

python_code = """
def function1():
    print("Hello, World!")

class MyClass:
    def __init__(self):
        self.value = 42

    def method1(self):
        return self.value
"""

splitter = PythonCodeTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)

chunks = splitter.split_text(python_code)

chunks

['def function1():\n    print("Hello, World!")',
 'class MyClass:\n    def __init__(self):\n        self.value = 42',
 'def method1(self):\n        return self.value']

HTMLTextSplitter: Structured Splitting for Web Content
The HTMLTextSplitter is tailored for HTML documents, maintaining the structure and hierarchy of HTML elements.

When to use:
- For processing web pages or HTML-formatted documents
- When HTML structure is important for your RAG task
- To extract content while preserving HTML context

In [12]:
#!pip install langchain_text_splitter
from langchain_text_splitters import HTMLSectionSplitter

html_text = """
<html>
<body>
<h1>Main Title</h1>
<p>This is a paragraph.</p>
<div>
    <h2>Subsection</h2>
    <p>Another paragraph.</p>
</div>
</body>
</html>
"""

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
splitter = HTMLSectionSplitter(
    headers_to_split_on=headers_to_split_on,
    chunk_size=100,
    chunk_overlap=20
)

chunks = splitter.split_text(html_text)

chunks

[Document(metadata={'Header 1': 'Main Title'}, page_content='Main Title \n This is a paragraph.'),
 Document(metadata={'Header 2': 'Subsection'}, page_content='Subsection \n Another paragraph.')]

SpacyTextSplitter: Advanced Linguistic Splitting
The SpacyTextSplitter leverages the spaCy library for more advanced, language-aware text splitting.

When to use:
- For highly accurate, linguistically informed splitting
- When working with multiple languages
- For tasks requiring deep language understanding

Remember to install spaCy and the appropriate language model before using this splitter.

Remember to install spaCy and the appropriate language model before using this splitter.

In [4]:
from langchain.text_splitter import SpacyTextSplitter

text = "Your long document text here. It can be in various languages. SpaCy will handle the linguistic nuances."

splitter = SpacyTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)

chunks = splitter.split_text(text)
chunks

['Your long document text here.\n\nIt can be in various languages.',
 'SpaCy will handle the linguistic nuances.']

LatexTextSplitter: Specialized for LaTeX Documents
The LatexTextSplitter is designed to handle LaTeX documents, respecting the unique structure and commands of LaTeX syntax.

When to use:
- Specifically for LaTeX documents
- In academic or scientific document processing
- To maintain LaTeX formatting and structure in splits

This splitter attempts to maintain the integrity of LaTeX commands and environments while creating chunks.

In [19]:
from langchain.text_splitter import LatexTextSplitter

latex_text = r"""
\documentclass{article}
\begin{document}
\section{Introduction}
This is the introduction.
\section{Methodology}
This is the methodology section.
\end{document}
"""

splitter = LatexTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)

chunks = splitter.split_text(latex_text)
chunks

['\\documentclass{article}\n\\begin{document}\n\\section{Introduction}\nThis is the',
 'is the introduction.\n\\section{Methodology}\nThis is the methodology section.\n\\end{document}']

Choosing the right text splitter is crucial for optimizing your RAG pipeline in Langchain. Each splitter offers unique advantages suited to different document types and use cases. The RecursiveCharacterTextSplitter serves as an excellent default choice for general purposes, while specialized splitters like MarkdownHeaderTextSplitter or PythonCodeTextSplitter offer tailored solutions for specific document formats.



Code
Since Programming languages have different structures than plain text, we can split the code based on the syntax of the specific language.

In [1]:
from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

PYTHON_CODE = """
def add(a, b):
    return a + b

class Calculator:
    def __init__(self):
        self.result = 0

    def add(self, value):
        self.result += value
        return self.result

    def subtract(self, value):
        self.result -= value
        return self.result

# Call the function
def main():
    calc = Calculator()
    print(calc.add(5))
    print(calc.subtract(2))

if __name__ == "__main__":
    main()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=100, chunk_overlap=0)
    
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(metadata={}, page_content='def add(a, b):\n    return a + b'),
 Document(metadata={}, page_content='class Calculator:\n    def __init__(self):\n        self.result = 0'),
 Document(metadata={}, page_content='def add(self, value):\n        self.result += value\n        return self.result'),
 Document(metadata={}, page_content='def subtract(self, value):\n        self.result -= value\n        return self.result'),
 Document(metadata={}, page_content='# Call the function'),
 Document(metadata={}, page_content='def main():\n    calc = Calculator()\n    print(calc.add(5))\n    print(calc.subtract(2))'),
 Document(metadata={}, page_content='if __name__ == "__main__":\n    main()')]

JSON
A nested json object can be split such that initial json keys are in all the related chunks of text. If there are any long lists inside, we can convert them into dictionaries to split. Let’s look at an example.

In [2]:
from langchain_text_splitters import RecursiveJsonSplitter

# Example JSON object
json_data = {
    "company": {
        "name": "TechCorp",
        "location": {
            "city": "Metropolis",
            "state": "NY"
        },
        "departments": [
            {
                "name": "Research",
                "employees": [
                    {"name": "Alice", "age": 30, "role": "Scientist"},
                    {"name": "Bob", "age": 25, "role": "Technician"}
                ]
            },
            {
                "name": "Development",
                "employees": [
                    {"name": "Charlie", "age": 35, "role": "Engineer"},
                    {"name": "David", "age": 28, "role": "Developer"}
                ]
            }
        ]
    },
    "financials": {
        "year": 2023,
        "revenue": 1000000,
        "expenses": 750000
    }
}


# Initialize the RecursiveJsonSplitter with a maximum chunk size
splitter = RecursiveJsonSplitter(max_chunk_size=200, min_chunk_size=20)

# Split the JSON object
chunks = splitter.split_text(json_data, convert_lists=True)

# Process the chunks as needed
for chunk in chunks:
    print(len(chunk))
    print(chunk)

84
{"company": {"name": "TechCorp", "location": {"city": "Metropolis", "state": "NY"}}}
183
{"company": {"departments": {"0": {"name": "Research", "employees": {"0": {"name": "Alice", "age": 30, "role": "Scientist"}, "1": {"name": "Bob", "age": 25, "role": "Technician"}}}}}}
188
{"company": {"departments": {"1": {"name": "Development", "employees": {"0": {"name": "Charlie", "age": 35, "role": "Engineer"}, "1": {"name": "David", "age": 28, "role": "Developer"}}}}}}
70
{"financials": {"year": 2023, "revenue": 1000000, "expenses": 750000}}
