# **LangChain Text Splitter**

In this notebook we will go through different types of text splitters in `LangChain`. Each splitter has it's own advantage and way to split the data.

Before doing into the deep of splitters we have to understand why there are so many splitters? And yeah by the ways we also called it as `chunking`.

#### First let's take a look on what LangChain say about `Text splitters`:

> Document splitting is often a crucial preprocessing step for many applications. It involves breaking down large texts into smaller, manageable chunks. This process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems. There are several strategies for splitting documents, each with its own advantages.


There are so many types of Loaders, each loader has it's own limitations and used for different types of documents.

Like `PyPDFLoader` is used to load PDFs, `WebBaseLoader` are used to load web pages. In the same way different splitters are used for different tasks, depending on what we need to split.

Some of the Common Splitters in LangChain are:


1. `CharacterTextSplitter` - Splits text based on character count, ensuring each chunk stays under a specified length.
   
2. `RecursiveCharacterTextSplitter` - Splits text hierarchically by breaking it down into smaller sections while respecting delimiters like paragraphs or sentences.
   
3. `MarkdownTextSplitter` - Splits text based on Markdown formatting, keeping logical structures like headings and lists intact.

4. `TokenTextSplitter` - Splits text by token count, often using tokenizer models to match LLM tokenization logic.

5. `LanguageSplitter` - Splits code files by programming language or structure.

6. `PythonCodeSplitter` - Specializes in splitting Python code into logical units like functions or classes.

7. `HTMLHeaderTextSplitter` - Splits HTML documents into meaningful parts such as tags, paragraphs, or sections.

8.  `XMLSplitter` - Splits XML files while preserving the structure of nodes and tags.

9.  `CustomSplitter` - A user-defined splitter for specialized splitting requirements

10. `NLTKTextSplitter` - Splitting text using NLTK package. 


----

We are using `CharacterTextSplitter` to split the Documents

### Parameters:
- separator: Delimiter for splitting (default: " ").
- chunk_size: Maximum size of each chunk.
- chunk_overlap: Number of overlapping characters between chunks.

In [1]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('./Data/Central_Limit_Theorem.pdf')

docs = loader.load()

In [3]:
len(docs)

7

In [4]:
docs[0]

Document(metadata={'source': './Data/Central_Limit_Theorem.pdf', 'page': 0}, page_content='Bernoulli distribution is a probability distribution that models a binary outcome, where the \noutcome can be either success (represented by the value 1) or failure (represented by the \nvalue 0). The Bernoulli distribution is named after the Swiss mathematician Jacob Bernoulli, \nwho first introduced it in the late 1600s.\nThe Bernoulli distribution is characterized by a single parameter, which is the probability of \nsuccess, denoted by p. The probability mass function (PMF) of the Bernoulli distribution is:\nThe Bernoulli distribution is commonly used in machine learning for modelling \nbinary outcomes, such as whether a customer will make a purchase or not, \nwhether an email is spam or not, or whether a patient will have a certain disease \nor not.\nBernoulli Distribution\n27 March 2023 16:06\n   Session on Central Limit Theorem Page 1    ')

In [5]:
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter(separator="\n", # Separator will help us to split the docs from where we want to split
                                 chunk_size=120,
                                 chunk_overlap=50) 


In most of the cases when we are dealing with this we often came accross these two methods in splitters.

#### 1. split_documents
#### 2. split_text

#### Let's see what is the difference between both of them:

The primary difference between split_text and split_documents in LangChain lies in what they process and the type of output they generate:

- 1. split_text
      - Purpose: Splits a single string of text into smaller chunks.
      - Input: A single string (e.g., raw text or the contents of a file).
      - Output: A list of strings, where each string is a chunk of the original text.
      - Use Case: When you want to split raw text directly into manageable pieces, often for tasks like tokenization or summarization.
- 2. split_documents
      - Purpose: Splits multiple documents into chunks.
      - Input: A list of documents, where each document is a dictionary with at least a page_content key containing text.
      - Output: A list of dictionaries, where each dictionary is a chunk of a document and retains metadata.
      - Use Case: When working with structured data (e.g., multiple documents) and you want to preserve metadata (e.g., document source, page number).



Use split_text for simple, unstructured text, and use split_documents for more complex workflows where you need to manage multiple documents along with their metadata.

In [6]:
chunks = splitter.split_documents(docs)

In [34]:
splitter = CharacterTextSplitter(separator="\n", # Separator will help us to split the docs from where we want to split
                                 chunk_size=120,
                                 chunk_overlap=50) 

text = """Document splitting is often a crucial preprocessing step for many applications.
It involves breaking down large texts into smaller, manageable chunks. 
This process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems. 
There are several strategies for splitting documents, each with its own advantages."""

chunk = splitter.split_text(text)




In [35]:
chunk

['Document splitting is often a crucial preprocessing step for many applications.',
 'It involves breaking down large texts into smaller, manageable chunks.',
 'This process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems.',
 'There are several strategies for splitting documents, each with its own advantages.']

We are using `RecursiveSplitter` to split the documents.

In [19]:
from langchain_community.document_loaders.word_document import Docx2txtLoader

loader = Docx2txtLoader('./Data/RAG_Types_Table.docx')
docs = loader.load()

In [23]:
docs

[Document(metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='RAG Types: Advantages, Disadvantages, Use Cases, and Additional Information\n\nRAG Type\n\nAdvantages\n\nDisadvantages\n\nWhen to Use\n\nAdditional Information\n\nHybrid RAG\n\n- High accuracy by combining multiple information sources\n- Handles diverse types of data (structured, unstructured) well\n- Robust in challenging scenarios\n\n- Complexity in implementation\n- Higher computational resources required\n- Increased latency\n\n- When accuracy is paramount, and there are multiple data types\n\nCombines retrieval-based techniques (like search engines or databases) and generation-based techniques (like GPT-based models) to provide comprehensive responses.\n\nGenerative RAG\n\n- Provides flexible and creative responses\n- Can generate human-like content\n- Capable of handling open-domain questions\n\n- Risk of generating hallucinated information\n- Requires more extensive training data\n\n- For open-ended or c

In [31]:
from langchain_text_splitters.character import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ".", " "],
    chunk_size=100,
    chunk_overlap=0
)

chunks = splitter.split_documents(docs)

In [28]:
chunks

[Document(metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='RAG Types: Advantages, Disadvantages, Use Cases, and Additional Information\n\nRAG Type\n\nAdvantages'),
 Document(metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='Disadvantages\n\nWhen to Use\n\nAdditional Information\n\nHybrid RAG'),
 Document(metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='- High accuracy by combining multiple information sources'),
 Document(metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='- Handles diverse types of data (structured, unstructured) well\n- Robust in challenging scenarios'),
 Document(metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='- Complexity in implementation\n- Higher computational resources required\n- Increased latency'),
 Document(metadata={'source': './Data/RAG_Types_Table.docx'}, page_content='- When accuracy is paramount, and there are multiple data types'),
 Document(metadata={'source': './Data/R

In [32]:
splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ".", " "],
    chunk_size=100,
    chunk_overlap=0
)

chunk = splitter.split_text(text)

In [33]:
chunk

['Document splitting is often a crucial preprocessing step for many applications.',
 'It involves breaking down large texts into smaller, manageable chunks.',
 'This process offers several benefits, such as ensuring consistent processing of varying document',
 'lengths, overcoming input size limitations of models, and improving the quality of text',
 'representations used in retrieval systems',
 '.',
 'There are several strategies for splitting documents, each with its own advantages.']

We are using `TokenTextSplitter` to split the documents

In [37]:
!pip install tiktoken -q

There are different methods to split our Data into tokens.

We can use `TokenTextSplitter` and also with `CharacterTextSplitter` and `RecursiveCharacterTextSplitter`.

In [38]:
from langchain_text_splitters.base import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=100, chunk_overlap=0)

chunks = splitter.split_text(text)

In [39]:
chunks

['Document splitting is often a crucial preprocessing step for many applications.\nIt involves breaking down large texts into smaller, manageable chunks. \nThis process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems. \nThere are several strategies for splitting documents, each with its own advantages.']

In [40]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)

texts = text_splitter.split_text(text)

In [44]:
len(texts)

1

In [43]:
len(texts[0])

464

In [45]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)

texts = text_splitter.split_text(text)

In [47]:
len(texts[0])

464

We are using `MarkdownTextSplitter` to split markdown files

In [48]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader('./Data/mark.md')

docs = loader.load()

In [50]:
docs[0]

Document(metadata={'source': './Data/mark.md'}, page_content='Hello\n\nHello\n\nHello\n\nHello\n\nThis is my book\n\nThis is my book\n\nThis is my book.\n\nThis is my book\n\nHi\n\n__Hello__\n\nhello\n\nhello\n\nHello\n\nHi\n\nhi\n\nHello\n\nHello\n\nhi\\\n\nBooks\n\nThis is my book')

In [51]:
from langchain_text_splitters.markdown import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size=50, chunk_overlap=0)
chunks = splitter.split_documents(docs)

In [52]:
chunks

[Document(metadata={'source': './Data/mark.md'}, page_content='Hello\n\nHello\n\nHello\n\nHello\n\nThis is my book'),
 Document(metadata={'source': './Data/mark.md'}, page_content='This is my book\n\nThis is my book.'),
 Document(metadata={'source': './Data/mark.md'}, page_content='This is my book\n\nHi\n\n__Hello__\n\nhello\n\nhello'),
 Document(metadata={'source': './Data/mark.md'}, page_content='Hello\n\nHi\n\nhi\n\nHello\n\nHello\n\nhi\\\n\nBooks'),
 Document(metadata={'source': './Data/mark.md'}, page_content='This is my book')]

We will split `language code` using `LanguageSplitter`

`RecursiveCharacterTextSplitter` includes pre-built lists of separators that are useful for splitting text in a specific programming language.

Supported languages are stored in the langchain_text_splitters.Language enum. They include:

"cpp",
"go",
"java",
"kotlin",
"js",
"ts",
"php",
"proto",
"python",
"rst",
"ruby",
"rust",
"scala",
"swift",
"markdown",
"latex",
"html",
"sol",
"csharp",
"cobol",
"c",
"lua",
"perl",
"haskell"

In [54]:
from langchain_text_splitters import Language, RecursiveCharacterTextSplitter

JS_CODE = """
function helloWorld() {
  console.log("Hello, World!");
}

// Call the function
helloWorld();
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)


In [55]:
js_docs = js_splitter.create_documents([JS_CODE])
js_docs

[Document(metadata={}, page_content='function helloWorld() {\n  console.log("Hello, World!");\n}'),
 Document(metadata={}, page_content='// Call the function\nhelloWorld();')]

In [58]:
html_text = """
<!DOCTYPE html>
<html>
    <head>
        <title>🦜️🔗 LangChain</title>
        <style>
            body {
                font-family: Arial, sans-serif;
            }
            h1 {
                color: darkblue;
            }
        </style>
    </head>
    <body>
        <div>
            <h1>🦜️🔗 LangChain</h1>
            <p>⚡ Building applications with LLMs through composability ⚡</p>
        </div>
        <div>
            As an open-source project in a rapidly developing field, we are extremely open to contributions.
        </div>
    </body>
</html>
"""

html_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.HTML, chunk_size=60, chunk_overlap=0
)
html_docs = html_splitter.create_documents([html_text])
html_docs

[Document(metadata={}, page_content='<!DOCTYPE html>\n<html>'),
 Document(metadata={}, page_content='<head>\n        <title>🦜️🔗 LangChain</title>'),
 Document(metadata={}, page_content='<style>\n            body {\n                font-family: Aria'),
 Document(metadata={}, page_content='l, sans-serif;\n            }\n            h1 {'),
 Document(metadata={}, page_content='color: darkblue;\n            }\n        </style>\n    </head'),
 Document(metadata={}, page_content='>'),
 Document(metadata={}, page_content='<body>'),
 Document(metadata={}, page_content='<div>\n            <h1>🦜️🔗 LangChain</h1>'),
 Document(metadata={}, page_content='<p>⚡ Building applications with LLMs through composability ⚡'),
 Document(metadata={}, page_content='</p>\n        </div>'),
 Document(metadata={}, page_content='<div>\n            As an open-source project in a rapidly dev'),
 Document(metadata={}, page_content='eloping field, we are extremely open to contributions.'),
 Document(metadata={}, page_

In [59]:
PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(metadata={}, page_content='def hello_world():\n    print("Hello, World!")'),
 Document(metadata={}, page_content='# Call the function\nhello_world()')]

We are using `HTMLHeaderTextSplitter` to split HTML code

In [62]:
from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader('./Data/langchain.html')

docs = loader.load()

In [63]:
docs

[Document(metadata={'source': './Data/langchain.html', 'title': 'BSHTMLLoader | 🦜️🔗 LangChain'}, page_content='\n\nBSHTMLLoader | 🦜️🔗 LangChain\n\n\n\n\n\n\nSkip to main contentIntegrationsAPI ReferenceMoreContributingPeopleError referenceLangSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.1💬SearchKProvidersAnthropicAWSGoogleHugging FaceMicrosoftOpenAIMoreProvidersAcreomActiveloop Deep LakeAerospikeAI21 LabsAimAINetworkAirbyteAirtableAlchemyAleph AlphaAlibaba CloudAnalyticDBAnnoyAnthropicAnyscaleApache Software FoundationApache DorisApifyAppleArangoDBArceeArcGISArgillaArizeArthurArxivAscendAskNewsAssemblyAIAstra DBAtlasAwaDBAWSAZLyricsBAAIBagelBagelDBBaichuanBaiduBananaBasetenBeamBeautiful SoupBibTeXBiliBiliBittensorBlackboardbookend.aiBoxBrave SearchBreebs (Open Knowledge)BrowserbaseBrowserlessByteDanceCassandraCerebrasCerebriumAIChaindeskChromaClarifaiClearMLClickHouseClickUpCloudflareClovaCnosDBCogniSwitchCohereCollege ConfidentialCometConfident AIConfluenceConneryContextCo

In [69]:
from langchain_text_splitters.html import HTMLHeaderTextSplitter

splitter = HTMLHeaderTextSplitter(headers_to_split_on=[
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
])

In [70]:
html_header_splits = html_splitter.split_documents(docs)
html_header_splits

[Document(metadata={'source': './Data/langchain.html', 'title': 'BSHTMLLoader | 🦜️🔗 LangChain'}, page_content='BSHTMLLoader | 🦜️🔗 LangChain\n\n\n\n\n\n\nSkip to main contentInt'),
 Document(metadata={'source': './Data/langchain.html', 'title': 'BSHTMLLoader | 🦜️🔗 LangChain'}, page_content='egrationsAPI ReferenceMoreContributingPeopleError referenceL'),
 Document(metadata={'source': './Data/langchain.html', 'title': 'BSHTMLLoader | 🦜️🔗 LangChain'}, page_content='angSmithLangGraphLangChain HubLangChain JS/TSv0.3v0.3v0.2v0.'),
 Document(metadata={'source': './Data/langchain.html', 'title': 'BSHTMLLoader | 🦜️🔗 LangChain'}, page_content='1💬SearchKProvidersAnthropicAWSGoogleHugging FaceMicrosoftOpe'),
 Document(metadata={'source': './Data/langchain.html', 'title': 'BSHTMLLoader | 🦜️🔗 LangChain'}, page_content='nAIMoreProvidersAcreomActiveloop Deep LakeAerospikeAI21 Labs'),
 Document(metadata={'source': './Data/langchain.html', 'title': 'BSHTMLLoader | 🦜️🔗 LangChain'}, page_content='AimAINetw

We can create our custom splitter using LangChain `TextSplitter`

In [73]:
from langchain.text_splitter import TextSplitter

class CustomSplitter(TextSplitter):
    def split_text(self, text):
        # Your custom splitting logic
        return text.split(";")

splitter = CustomSplitter(chunk_size=100, chunk_overlap=10)
chunks = splitter.split_text("Part1;Part2;Part3")
print(chunks)

['Part1', 'Part2', 'Part3']


In [74]:
text

'Document splitting is often a crucial preprocessing step for many applications.\nIt involves breaking down large texts into smaller, manageable chunks. \nThis process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems. \nThere are several strategies for splitting documents, each with its own advantages.'

We can also split the documents using `NLTKTextSplitter`

In [76]:
from langchain_text_splitters.nltk import NLTKTextSplitter

splitter = NLTKTextSplitter()

chunk = splitter.split_text(text)

In [77]:
chunk

['Document splitting is often a crucial preprocessing step for many applications.\n\nIt involves breaking down large texts into smaller, manageable chunks.\n\nThis process offers several benefits, such as ensuring consistent processing of varying document lengths, overcoming input size limitations of models, and improving the quality of text representations used in retrieval systems.\n\nThere are several strategies for splitting documents, each with its own advantages.']

We can split the Python code Specifically using `PythonCodeTextSplitter`.

In [78]:
from langchain_community.document_loaders.python import PythonLoader

loader = PythonLoader('./Data/sms.py')
docs = loader.load()

In [79]:
from langchain_text_splitters.python import PythonCodeTextSplitter

splitter = PythonCodeTextSplitter()

chunks = splitter.split_documents(docs)

In [83]:
chunks

[Document(metadata={'source': './Data/sms.py'}, page_content='# Step 1: Dictionary to store student data\nstudents = {}\n\n\n# Step 2: Function to add a new student\ndef add_student():\n    name = input("Enter student\'s name: ").title()\n    age = int(input(f"Enter {name}\'s age: "))\n    marks = float(input(f"Enter {name}\'s marks: "))\n\n    # Add student to the dictionary\n    students[name] = {"age": age, "marks": marks}\n    print(f"Student {name} added successfully!")\n\n\n# Step 3: Function to update student marks\ndef update_marks():\n    name = input("Enter the student\'s name to update marks: ").title()\n\n    if name in students:\n        new_marks = float(input(f"Enter new marks for {name}: "))\n        students[name]["marks"] = new_marks\n        print(f"{name}\'s marks updated to {new_marks}!")\n    else:\n        print(f"Student {name} not found!")\n\n\n# Step 4: Function to display a student\'s details\ndef display_student():\n    name = input("Enter the student\'s nam