**Reference Link:** [RAG Systems Essentials (Analytics Vidhya)](https://courses.analyticsvidhya.com/courses/take/rag-systems-essentials/lessons/60148017-hands-on-deep-dive-into-rag-evaluation-metrics-generator-metrics-i)

# Exploring Document Splitters and Chunkers in LangChain

## Install OpenAI, HuggingFace and LangChain dependencies

In [1]:
%run setup.ipynb

## Document Splitting and Chunking

After loading documents into LangChain, you might need to transform them for optimal use in your application. One common transformation is splitting a long document into smaller segments to fit within your model's context window. LangChain provides several built-in document transformers to facilitate the splitting, combining, filtering, and manipulating of documents.

#### Process of Document Splitting:
1. **Splitting into Chunks:**
   - Break down the text into small, semantically meaningful units (typically sentences).
   
2. **Combining Chunks:**
   - Assemble these smaller units into larger chunks until they reach a predefined size. This size is determined by a specific measurement function.

3. **Creating Overlapping Chunks:**
   - Once the maximum size is reached, finalize the chunk as an independent text piece.
   - Begin a new chunk, incorporating some overlap with the previous chunk to maintain textual context.

This approach ensures that semantically related text pieces are kept together, which is crucial for maintaining the meaning and continuity of the document.


### Fixed Length Chunking

Fixed-length chunking is a straightforward approach where text is divided into **predefined chunk sizes**, typically based on tokens or characters.

**Mechanism**

•	Splits text into **fixed-size chunks** without considering meaning.

•	Chunks are defined by **a set number of tokens or characters**.

**Ideal Use Cases**

•	Suitable for **structured documents**, FAQs, or cases where **processing speed** is a priority.

**Pros**

•	**Simplicity:** Easy to implement without complex algorithms.

•	**Uniformity:** Consistent chunk sizes simplify indexing and retrieval.

**Cons**

•	**Context Loss:** May split sentences or ideas, leading to incomplete information.

•	**Relevance Issues:** Important details may span multiple chunks, reducing retrieval effectiveness.

**Implementation Strategies**

•	Select a **chunk size** that balances **context retention and efficiency**.

•	Use **overlapping windows** to preserve continuity and reduce context loss.

In [2]:
sample_text = """
Introduction

Data Science is an interdisciplinary field that uses scientific methods, processes,
 algorithms, and systems to extract knowledge and insights from structured and 
 unstructured data. It draws from statistics, computer science, machine learning, 
 and various data analysis techniques to discover patterns, make predictions, and 
 derive actionable insights.

Data Science can be applied across many industries, including healthcare, finance,
 marketing, and education, where it helps organizations make data-driven decisions,
  optimize processes, and understand customer behaviors.

Overview of Big Data

Big data refers to large, diverse sets of information that grow at ever-increasing 
rates. It encompasses the volume of information, the velocity or speed at which it 
is created and collected, and the variety or scope of the data points being 
covered.

Data Science Methods

There are several important methods used in Data Science:

1. Regression Analysis
2. Classification
3. Clustering
4. Neural Networks

Challenges in Data Science

- Data Quality: Poor data quality can lead to incorrect conclusions.
- Data Privacy: Ensuring the privacy of sensitive information.
- Scalability: Handling massive datasets efficiently.

Conclusion

Data Science continues to be a driving force in many industries, offering insights 
that can lead to better decisions and optimized outcomes. It remains an evolving 
field that incorporates the latest technological advancements.
"""


In [3]:
def fixed_size_chunk(text, max_words=100):
    words = text.split()
    return [' '.join(words[i:i + max_words]) for i in range(0, len(words), 
    max_words)]

# Applying Fixed-Size Chunking
fixed_chunks = fixed_size_chunk(sample_text)
i = 0
for chunk in fixed_chunks:
  print(f"chunk {i+1}")
  print(chunk, '\n---\n')
  i += 1

chunk 1
Introduction Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It draws from statistics, computer science, machine learning, and various data analysis techniques to discover patterns, make predictions, and derive actionable insights. Data Science can be applied across many industries, including healthcare, finance, marketing, and education, where it helps organizations make data-driven decisions, optimize processes, and understand customer behaviors. Overview of Big Data Big data refers to large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity 
---

chunk 2
or speed at which it is created and collected, and the variety or scope of the data points being covered. Data Science Methods There are several important methods used in Data Science: 1. Regression Analysis 2. Classification 

### Sentence Based Chunking

Sentence-based chunking ensures that each chunk represents a **complete thought** by splitting text at **sentence boundaries**.

**Mechanism**

•	Splits text at **sentence boundaries** to maintain logical structure.

•	Each chunk contains a **full sentence** rather than an arbitrary length of text.

**Ideal Use Cases**

•	Works best for **short, direct responses** such as **customer queries or conversational AI** applications.

**Pros**

•	**Context Preservation:** Maintains the integrity of individual sentences.

•	**Ease of Implementation:** Utilizes NLP tools for accurate sentence detection.

**Cons**

•	**Limited Context:** Single sentences may lack sufficient context for complex queries.

•	**Variable Length:** Sentence lengths can vary, leading to inconsistent chunk sizes.

**Implementation Strategies**

•	Use **NLP libraries** for accurate sentence boundary detection.

•	**Combine multiple short sentences** into a single chunk to provide more context.

In [8]:
# Install the en_core_web_sm model
%pip install spacy
!python -m spacy download en_core_web_sm

Collecting spacy
  Downloading spacy-3.8.7-cp311-cp311-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Downloading murmurhash-1.0.13-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Downloading cymem-2.0.11-cp311-cp311-macosx_11_0_arm64.whl.metadata (8.5 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Downloading preshed-3.0.10-cp311-cp311-macosx_11_0_arm64.whl.metadata (2.4 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Downloading thinc-8.3.6-cp311-cp311-macosx_11_0_arm64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Downloading wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Down

In [9]:
import spacy  # Import the spaCy library for NLP tasks

# Load the small English model for spaCy (used for sentence segmentation)
nlp = spacy.load("en_core_web_sm")

def sentence_chunk(text):
    """
    Splits the input text into sentences using spaCy's sentence boundary detection.

    Args:
        text (str): The text to be chunked into sentences.

    Returns:
        List[str]: A list of sentences extracted from the text.
    """
    doc = nlp(text)  # Process the text with spaCy to create a Doc object
    return [sent.text for sent in doc.sents]  # Extract and return sentences as strings

# Applying Sentence-Based Chunking
sentence_chunks = sentence_chunk(sample_text)  # Split the sample_text into sentence chunks

i = 0  # Initialize chunk counter
for chunk in sentence_chunks:
    print(f"chunk {i+1}")  # Print the chunk number (1-based index)
    print(chunk, '\n---\n')  # Print the chunk content followed by a separator
    i += 1  # Increment the chunk counter

chunk 1

Introduction

Data Science is an interdisciplinary field that uses scientific methods, processes,
 algorithms, and systems to extract knowledge and insights from structured and 
 unstructured data. 
---

chunk 2
It draws from statistics, computer science, machine learning, 
 and various data analysis techniques to discover patterns, make predictions, and 
 derive actionable insights.

 
---

chunk 3
Data Science can be applied across many industries, including healthcare, finance,
 marketing, and education, where it helps organizations make data-driven decisions,
  optimize processes, and understand customer behaviors.

 
---

chunk 4
Overview of Big Data

Big data refers to large, diverse sets of information that grow at ever-increasing 
rates. 
---

chunk 5
It encompasses the volume of information, the velocity or speed at which it 
is created and collected, and the variety or scope of the data points being 
covered.

 
---

chunk 6
Data Science Methods

There are several im

### Paragraph Based Chunking

**Paragraph-Based Chunking**

Paragraph-based chunking divides documents into **paragraphs**, ensuring each chunk encapsulates a **complete idea or topic**.

**Mechanism**

•	Splits text at **paragraph boundaries**, preserving logical flow.

•	Each chunk typically represents a **self-contained idea** or section.

**Ideal Use Cases**

•	Suitable for **structured documents** such as **articles, reports, or essays**.

**Pros**

•	**Richer Context:** Provides more information than sentence-based chunks.

•	**Logical Division:** Aligns with the natural structure of the text.

**Cons**

•	**Inconsistent Sizes:** Paragraph lengths can vary widely.

•	**Token Limits:** Large paragraphs may exceed the model’s token constraints.

**Implementation Strategies**

•	Monitor chunk sizes to ensure they stay within acceptable **token limits**.

•	If necessary, **split large paragraphs** further while preserving context.

In [10]:
def paragraph_chunk(text):
    paragraphs = text.split('\n\n')
    return paragraphs

# Applying Paragraph-Based Chunking
paragraph_chunks = paragraph_chunk(sample_text)
for chunk in paragraph_chunks:
    print(chunk, '\n---\n')


Introduction 
---

Data Science is an interdisciplinary field that uses scientific methods, processes,
 algorithms, and systems to extract knowledge and insights from structured and 
 unstructured data. It draws from statistics, computer science, machine learning, 
 and various data analysis techniques to discover patterns, make predictions, and 
 derive actionable insights. 
---

Data Science can be applied across many industries, including healthcare, finance,
 marketing, and education, where it helps organizations make data-driven decisions,
  optimize processes, and understand customer behaviors. 
---

Overview of Big Data 
---

Big data refers to large, diverse sets of information that grow at ever-increasing 
rates. It encompasses the volume of information, the velocity or speed at which it 
is created and collected, and the variety or scope of the data points being 
covered. 
---

Data Science Methods 
---

There are several important methods used in Data Science: 
---

1. Regr

### Sliding Window Chunking

Sliding window chunking creates **overlapping chunks** by shifting a window over the text, ensuring adjacent chunks share content for better context retention.

**Mechanism**

•	Uses a **moving window** to generate chunks with overlapping content.

•	Helps maintain continuity by ensuring **important details appear in multiple chunks**.

**Ideal Use Cases**

•	Best suited for **documents where maintaining context across sections is critical**, such as **legal or medical texts**.

**Pros**

•	**Context Continuity:** Overlaps help preserve the flow of information.

•	**Improved Retrieval:** Increases the chances that relevant information is included in the retrieved chunks.

**Cons**

•	**Redundancy:** Overlapping content can lead to duplicate information.

•	**Computational Cost:** More chunks require additional processing and storage.

**Implementation Strategies**

•	Optimize **window size and overlap** based on the document’s nature.

•	Use **deduplication techniques** during retrieval to handle redundancy.

In [12]:
def sliding_window_chunk(text, chunk_size=100, overlap=20):
    """
    Splits the input text into overlapping chunks using a sliding window approach.

    Args:
        text (str): The input text to be chunked.
        chunk_size (int): The number of tokens in each chunk.
        overlap (int): The number of tokens that overlap between consecutive chunks.

    Returns:
        list: A list of overlapping text chunks.
    """
    tokens = text.split()  # Split the text into tokens (words)
    chunks = []  # List to store the resulting chunks
    # Iterate over the tokens with a step size of (chunk_size - overlap)
    for i in range(0, len(tokens), chunk_size - overlap):
        # Create a chunk by joining the tokens from i to i + chunk_size
        chunk = ' '.join(tokens[i:i + chunk_size])
        chunks.append(chunk)  # Add the chunk to the list
    return chunks

# Applying Sliding Window Chunking
sliding_chunks = sliding_window_chunk(sample_text)
# Print each chunk followed by a separator for clarity
for chunk in sliding_chunks:
    print(chunk, '\n---\n')

Introduction Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It draws from statistics, computer science, machine learning, and various data analysis techniques to discover patterns, make predictions, and derive actionable insights. Data Science can be applied across many industries, including healthcare, finance, marketing, and education, where it helps organizations make data-driven decisions, optimize processes, and understand customer behaviors. Overview of Big Data Big data refers to large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity 
---

refers to large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity or speed at which it is created and collected, and the variety or scope of the data points being covered. 

Introduction Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It draws from statistics, computer science, machine learning, and various data analysis techniques to discover patterns, make predictions, and derive actionable insights. Data Science can be applied across many industries, including healthcare, finance, marketing, and education, where it helps organizations make data-driven decisions, optimize processes, and understand customer behaviors. Overview of Big Data Big data `refers to large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity`

`refers to large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity` or speed at which it is created and collected, and the variety or scope of the data points being covered. Data Science Methods There are several important methods used in Data Science: 1. Regression Analysis 2. Classification 3. Clustering 4. Neural Networks Challenges in Data Science - Data Quality: Poor data quality can lead to incorrect conclusions. - Data Privacy: `Ensuring the privacy of sensitive information. - Scalability: Handling massive datasets efficiently. Conclusion Data Science continues to be a driving` 

`Ensuring the privacy of sensitive information. - Scalability: Handling massive datasets efficiently. Conclusion Data Science continues to be a driving` force in many industries, offering insights that can lead to better decisions and optimized outcomes. It remains an evolving field that incorporates the latest technological advancements. 



### Semantic Based Chunking

Semantic-based chunking leverages **embeddings or machine learning models** to split text based on **semantic meaning**, ensuring each chunk remains **cohesive in topic or idea**.

**Mechanism**

•	Uses **NLP models** to detect and segment text based on **meaning rather than fixed sizes**.

•	Ensures that **each chunk represents a complete thought or concept**.

**Ideal Use Cases**

•	Best for **complex queries requiring deep understanding**, such as **technical manuals or academic papers**.

**Pros**

•	**Contextual Relevance:** Chunks are meaningfully grouped, improving retrieval accuracy.

•	**Flexibility:** Adapts to the text’s inherent structure and content.

**Cons**

•	**Complexity:** Requires advanced NLP models and significant computational resources.

•	**Processing Time:** Semantic analysis can be time-consuming.

**Implementation Strategies**

•	Use **pre-trained models** for **semantic segmentation**.

•	Balance between **computational cost and granularity** to optimize efficiency.

In [14]:
def semantic_chunk(text, max_len=200):
    # Process the input text using the NLP model to obtain a Doc object
    doc = nlp(text)
    chunks = []         # List to store the resulting text chunks
    current_chunk = []  # Temporary list to accumulate sentences for the current chunk

    # Iterate over each sentence detected by the NLP model
    for sent in doc.sents:
        current_chunk.append(sent.text)  # Add the sentence text to the current chunk

        # If the combined length of the current chunk exceeds the maximum length
        if len(' '.join(current_chunk)) > max_len:
            # Join the sentences to form a chunk and add to the list of chunks
            chunks.append(' '.join(current_chunk))
            current_chunk = []  # Reset the current chunk for the next set of sentences

    # After the loop, if there are any remaining sentences, add them as the last chunk
    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks  # Return the list of semantically-chunked text segments

# Applying Semantic-Based Chunking
# Use the semantic_chunk function to split the sample_text into semantically meaningful chunks
semantic_chunks = semantic_chunk(sample_text)

# Print each chunk, separated by a visual divider for clarity
for chunk in semantic_chunks:
    print(chunk, '\n---\n')



Introduction

Data Science is an interdisciplinary field that uses scientific methods, processes,
 algorithms, and systems to extract knowledge and insights from structured and 
 unstructured data. It draws from statistics, computer science, machine learning, 
 and various data analysis techniques to discover patterns, make predictions, and 
 derive actionable insights.

 
---

Data Science can be applied across many industries, including healthcare, finance,
 marketing, and education, where it helps organizations make data-driven decisions,
  optimize processes, and understand customer behaviors.

 
---

Overview of Big Data

Big data refers to large, diverse sets of information that grow at ever-increasing 
rates. It encompasses the volume of information, the velocity or speed at which it 
is created and collected, and the variety or scope of the data points being 
covered.

 
---

Data Science Methods

There are several important methods used in Data Science:

1. Regression Analysi

### RecursiveCharacterTextSplitter

The `RecursiveCharacterTextSplitter` is a versatile tool within LangChain for splitting text based on a list of characters. This splitter is designed to handle various requirements through adjustable parameters.

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

#### Features and Parameters:

- **Character List:** Utilizes a specified list of characters to determine where splits should occur.
- **Chunk Size:** Allows you to set the size of each chunk, helping ensure that chunks are manageable and suit the context window of your model.
- **Overlap:** Configurable overlap between consecutive chunks to maintain context continuity across chunks.

This splitter is particularly useful for texts where precise control over the splitting criteria is needed, allowing for customized chunking strategies based on specific characters.


In [21]:
from pprint import pprint

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

doc = """Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various competitions, and a night market that lights up the town with vibrant colors and joyous energy.
"""

In [22]:
pprint(doc)

('Welcome to Green Valley, a small town nestled in the heart of the mountains. '
 'With its picturesque landscapes and vibrant community life, Green Valley has '
 'been a hidden gem for years. The main street is lined with an array of shops '
 'and cafes, each offering a unique taste of local flavor and culture.\n'
 'On a typical afternoon, the town square comes alive with the bustling sounds '
 'of locals and visitors mingling. Children play near the fountain, artists '
 'display their crafts, and an old man tells stories of days gone by. The '
 'aroma of freshly baked bread wafts from the bakery, drawing a steady stream '
 'of customers.\n'
 'Green Valley is not only known for its scenic beauty but also for its annual '
 'festivals. The most anticipated event is the Harvest Festival, celebrated '
 'with great enthusiasm. Locals prepare months in advance, cultivating crops '
 'and crafting goods for the occasion. The festival features a parade, various '
 'competitions, and a night ma

In [17]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""], #First by Paragraph, then by lines, then by Words and then empty string character split
    chunk_size=300,
    chunk_overlap=0,
)

In [18]:
text_splitter.create_documents([doc])

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.'),
 Document(metadata={}, page_content='On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream'),
 Document(metadata={}, page_content='of customers.'),
 Document(metadata={}, page_content='Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and 

In [23]:
texts = text_splitter.split_text(doc)
print(len(texts)) # 5

5


In [24]:
texts

['Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.', 'On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream', 'of customers.', 'Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade,', 'various competitions, and a night market that lights up the town with vibrant 

In [25]:
for text in texts:
    print(text)
    print(len(text))
    print()

Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
299

On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream
298

of customers.
13

Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade,
294

various competitions, and a night market that lights up the town with vib

Splitting with larger chunk size (total characters) makes less paragraphs

In [26]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=500,
    chunk_overlap=0,
)

texts = text_splitter.split_text(doc)
print(len(texts)) # 3

3


In [27]:
for text in texts:
    print(text)
    print(len(text))
    print()

Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
299

On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
312

Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various competitions, and a night market that lights up the town with vibrant colo

`chunk_overlap` helps to mitigate loss of information when context is divided between chunks especially for really small chunks

In [28]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=300,
    chunk_overlap=100,
)

texts = text_splitter.split_text(doc)
print(len(texts)) # 5

5


In [29]:
for text in texts:
    print(text)
    print(len(text))
    print()

Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
299

On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream
298

of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
110

Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival

You can create LangChain `Document` chunks with the `create_documents` function

In [30]:
text_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", " ", ""],
    chunk_size=500,
    chunk_overlap=100,
)

In [31]:
docs = text_splitter.create_documents([doc])
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.'), Document(metadata={}, page_content='On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.'), Document(metadata={}, page_content='Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festi

### CharacterTextSplitter

The `CharacterTextSplitter` is a straightforward tool in LangChain for dividing text based on a specified character. It's designed to be simple yet effective, providing essential controls for customizing how text is segmented.

#### Key Features and Parameters:
- **Split Character:** By default, it uses a empty string character ("") to split the text, but this can be customized to any character you specify.
- **Chunk Size:** Allows you to define the length of each chunk in terms of the number of characters. This is useful for ensuring each piece of text is of a manageable size for processing.
- **Overlap:** You can set the amount of overlap between consecutive chunks. This helps maintain context and continuity when text is split into separate parts.

This method is the simplest among text splitting tools, focusing on character-based division and providing straightforward measures for chunk length and overlap.

To obtain the string content directly, use `.split_text`.

To create LangChain `Document` objects (e.g., for use in downstream tasks), use `.create_documents`.


In [32]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap=200,
    is_separator_regex=False,
)

docs = text_splitter.create_documents([doc])
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.'), Document(metadata={}, page_content='On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.'), Document(metadata={}, page_content='Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festi

In [35]:
pprint(docs)

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.'),
 Document(metadata={}, page_content='On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.'),
 Document(metadata={}, page_content='Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The fes

In [40]:
text = text_splitter.split_text(doc)
print(len(text))


3


In [41]:
for t in text:
    print(t)
    print(len(t))
    print()

Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
299

On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
312

Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various competitions, and a night market that lights up the town with vibrant colo

### Code Splitters

`RecursiveCharacterTextSplitter` includes pre-built lists of separators that are useful for splitting text in a specific programming language.

In [42]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

python_code = """
def hello_world():
    print("Hello, World!")
hello_world()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([python_code])
python_docs

[Document(metadata={}, page_content='def hello_world():\n    print("Hello, World!")'), Document(metadata={}, page_content='hello_world()')]

### Markdown Splitters

We might want to chunk a document based on the structure. For example, a markdown file is organized by headers. Creating chunks within specific header groups is an intuitive idea. To address this challenge, we can use MarkdownHeaderTextSplitter. This will split a markdown file by a specified set of headers.

For example, if we want to split this markdown:

```
markdown_document = """
# Team Introductions

## Management Team

Hi, this is Jim, the CEO.  
Hi, this is Joe, the CFO.

## Development Team

Hi, this is Molly, the Lead Developer.
"""
```

We can specify the headers to split on:

```
[("#", "Header 1"),
 ("##", "Header 2")]
```

And content is grouped or split by common headers:

```
Document(page_content='Hi, this is Jim, the CEO.\nHi, this is Joe, the CFO.',
metadata={'Header 1': 'Team Introductions', 'Header 2': 'Management Team'})

Document(page_content='Hi, this is Molly, the Lead Developer.',
metadata={'Header 1': 'Team Introductions', 'Header 2': 'Development Team'})
```

In [43]:
markdown_document = """
# Team Introductions

## Management Team
Hi, this is Jim, the CEO.
Hi, this is Joe, the CFO.

## Development Team
Hi, this is Molly, the Lead Developer.

### Intern Team
Hi, This is Subhash. The new Intern
"""

In [44]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

[Document(metadata={'Header 1': 'Team Introductions', 'Header 2': 'Management Team'}, page_content='Hi, this is Jim, the CEO.\nHi, this is Joe, the CFO.'), Document(metadata={'Header 1': 'Team Introductions', 'Header 2': 'Development Team'}, page_content='Hi, this is Molly, the Lead Developer.'), Document(metadata={'Header 1': 'Team Introductions', 'Header 2': 'Development Team', 'Header 3': 'Intern Team'}, page_content='Hi, This is Subhash. The new Intern')]

By default, `MarkdownHeaderTextSplitter` strips headers being split on from the output chunk's content. This can be disabled by setting `strip_headers = False`.

In [45]:
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on, strip_headers=False)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits

[Document(metadata={'Header 1': 'Team Introductions', 'Header 2': 'Management Team'}, page_content='# Team Introductions  \n## Management Team\nHi, this is Jim, the CEO.\nHi, this is Joe, the CFO.'), Document(metadata={'Header 1': 'Team Introductions', 'Header 2': 'Development Team'}, page_content='## Development Team\nHi, this is Molly, the Lead Developer.'), Document(metadata={'Header 1': 'Team Introductions', 'Header 2': 'Development Team', 'Header 3': 'Intern Team'}, page_content='### Intern Team\nHi, This is Subhash. The new Intern')]

### Tokenizer based Splitting

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model. Let's look at how we can chunk documents using different tokenizers



#### tiktoken splitters

[`tiktoken`](https://github.com/openai/tiktoken) is a fast BPE tokenizer created by OpenAI.

We can use tiktoken to estimate tokens used. It will probably be more accurate for the OpenAI models. We measure the `chunk_size`here based on the number of tokens typically and not the number of characters

For Open AI models, roughly 1 token = 3\4 words.

Approx: 100 tokens ~= 75 words.



We can load a [`TokenTextSplitter`](https://api.python.langchain.com/en/latest/base/langchain_text_splitters.base.TokenTextSplitter.html) splitter, which works with `tiktoken` directly and will ensure each split is smaller than chunk size in terms of the number of tokens.

In [46]:
doc = """Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various competitions, and a night market that lights up the town with vibrant colors and joyous energy.
"""

In [47]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(model_name='gpt-4o-mini',
                                  chunk_size=30,
                                  chunk_overlap=10)

docs = text_splitter.create_documents([doc])

In [48]:
len(docs)

10

In [49]:
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a'), Document(metadata={}, page_content=' and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering'), Document(metadata={}, page_content=' with an array of shops and cafes, each offering a unique taste of local flavor and culture.\nOn a typical afternoon, the town square comes alive with'), Document(metadata={}, page_content=' a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts'), Document(metadata={}, page_content=' Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts'), Document(metadata={}, pag

In [50]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Tokens:', len(enc.encode(d.page_content)),
        'Chunk:', d.page_content)

Words: 27 Tokens: 30 Chunk: Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a
Words: 28 Tokens: 30 Chunk:  and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering
Words: 27 Tokens: 30 Chunk:  with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with
Words: 27 Tokens: 30 Chunk:  a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts
Words: 27 Tokens: 30 Chunk:  Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts
Words: 26 Tokens: 30 Chunk:  by. The aroma of freshly baked bread wafts from the b

In [51]:
text_splitter = TokenTextSplitter(model_name='gpt-4o-mini',
                                  chunk_size=100,
                                  chunk_overlap=30)

docs = text_splitter.create_documents([doc])

In [52]:
len(docs)

3

In [53]:
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.\nOn a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone'), Document(metadata={}, page_content=' the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.\nGreen Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated

In [54]:
enc = tiktoken.encoding_for_model("gpt-4o-mini")
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Tokens:', len(enc.encode(d.page_content)),
        'Chunk:', d.page_content)

Words: 88 Tokens: 100 Chunk: Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone
Words: 86 Tokens: 100 Chunk:  the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusia

To implement a hard constraint on the chunk size, we can use `RecursiveCharacterTextSplitter.from_tiktoken_encoder`, where each split will be recursively split if it has a larger size and it makes the chunks more meaningful

In [55]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4o-mini",
    chunk_size=100,
    chunk_overlap=30,
)

docs = text_splitter.create_documents([doc])

In [56]:
len(docs)

3

In [57]:
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.'), Document(metadata={}, page_content='On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.'), Document(metadata={}, page_content='Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festi

In [58]:
enc = tiktoken.encoding_for_model("gpt-4o-mini")
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Tokens:', len(enc.encode(d.page_content)),
        'Chunk:', d.page_content)

Words: 53 Tokens: 59 Chunk: Welcome to Green Valley, a small town nestled in the heart of the mountains. With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years. The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.
Words: 53 Tokens: 62 Chunk: On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.
Words: 63 Tokens: 72 Chunk: Green Valley is not only known for its scenic beauty but also for its annual festivals. The most anticipated event is the Harvest Festival, celebrated with great enthusiasm. Locals prepare months in advance, cultivating crops and crafting goods for the occasion. The festival features a parade, various 

#### spaCy

[spaCy](https://spacy.io/) is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.

LangChain implements splitters based on the [spaCy tokenizer](https://spacy.io/api/tokenizer).

In [59]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [60]:
from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=500,
                                  chunk_overlap=50)

docs = text_splitter.create_documents([doc])



In [61]:
len(docs)

3

In [62]:
docs

[Document(metadata={}, page_content='Welcome to Green Valley, a small town nestled in the heart of the mountains.\n\nWith its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years.\n\nThe main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.\n\n\nOn a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling.'), Document(metadata={}, page_content='Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by.\n\nThe aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.\n\n\nGreen Valley is not only known for its scenic beauty but also for its annual festivals.\n\nThe most anticipated event is the Harvest Festival, celebrated with great enthusiasm.\n\nLocals prepare months in advance, cultivating crops and crafting goods for the occasion.'), Document(metadata={}

In [63]:
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Characters:', len(d.page_content),
        'Chunk:', d.page_content)

Words: 68 Characters: 413 Chunk: Welcome to Green Valley, a small town nestled in the heart of the mountains.

With its picturesque landscapes and vibrant community life, Green Valley has been a hidden gem for years.

The main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture.


On a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling.
Words: 72 Characters: 470 Chunk: Children play near the fountain, artists display their crafts, and an old man tells stories of days gone by.

The aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers.


Green Valley is not only known for its scenic beauty but also for its annual festivals.

The most anticipated event is the Harvest Festival, celebrated with great enthusiasm.

Locals prepare months in advance, cultivating crops and crafting goods for the occasion.
Words: 22 Characters: 135 Chunk: The festival fea

#### SentenceTransformers

The [`SentenceTransformersTokenTextSplitter`](https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html) is a specialized text splitter for use with the `sentence-transformer` language models.

The default behaviour is to split the text into chunks that fit the token window of the sentence transformer model that you would like to use.

In [66]:
!pip install -qq sentence_transformers

In [67]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(model_name="sentence-transformers/all-mpnet-base-v2",
                                                 tokens_per_chunk=100,
                                                 chunk_overlap=30)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [68]:
docs = splitter.create_documents([doc])

In [69]:
len(docs)

3

In [70]:
docs


[Document(metadata={}, page_content='welcome to green valley, a small town nestled in the heart of the mountains. with its picturesque landscapes and vibrant community life, green valley has been a hidden gem for years. the main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture. on a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. children play near the fountain, artists display their crafts, and an old man tells stories of days'), Document(metadata={}, page_content='the bustling sounds of locals and visitors mingling. children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. the aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers. green valley is not only known for its scenic beauty but also for its annual festivals. the most anticipated event is the harvest festival, celebrated with gr

In [71]:
for d in docs:
  print('Words:', len(d.page_content.split(' ')),
        'Characters:', len(d.page_content),
        'Chunk:', d.page_content)

Words: 88 Characters: 509 Chunk: welcome to green valley, a small town nestled in the heart of the mountains. with its picturesque landscapes and vibrant community life, green valley has been a hidden gem for years. the main street is lined with an array of shops and cafes, each offering a unique taste of local flavor and culture. on a typical afternoon, the town square comes alive with the bustling sounds of locals and visitors mingling. children play near the fountain, artists display their crafts, and an old man tells stories of days
Words: 84 Characters: 517 Chunk: the bustling sounds of locals and visitors mingling. children play near the fountain, artists display their crafts, and an old man tells stories of days gone by. the aroma of freshly baked bread wafts from the bakery, drawing a steady stream of customers. green valley is not only known for its scenic beauty but also for its annual festivals. the most anticipated event is the harvest festival, celebrated with great enthus

### Section-based Splitting in Unstructured.io

Chunking functions in `unstructured` use metadata and document elements detected with partition functions to split a document into smaller parts for uses cases such as Retrieval Augmented Generation (RAG).

`unstructured` uses specific knowledge about each document format to partition the document into semantic units (document elements), we only need to resort to text-splitting when a single element exceeds the desired maximum chunk size. Except in that case, all chunks contain one or more whole elements, preserving the coherence of semantic units established during partitioning.

- Chunking is performed on document elements. It is a separate step performed after partitioning, on the elements produced by partitioning. (Although it can be combined with partitioning in a single step.)

- Chunking combines consecutive elements to form chunks as large as possible without exceeding the maximum chunk size.

- A single element that by itself exceeds the maximum chunk size is divided into two or more chunks using text-splitting.

- Chunking produces a sequence of `CompositeElement`, `Table`, or `TableChunk` elements. Each “chunk” is an instance of one of these three types.

Chunking Options:

The following options are available to tune chunking behaviors. These are keyword arguments that can be used in a partitioning or chunking function call. All these options have defaults and need only be specified when a non-default setting is required. Specific chunking strategies (such as “by-title”) may have additional options.

- `max_characters`: (default=500) - the hard maximum size for a chunk. No chunk will exceed this number of characters. A single element that by itself exceeds this size will be divided into two or more chunks using text-splitting.

- `new_after_n_chars`: (default=max_characters) - the “soft” maximum size for a chunk. A chunk that already exceeds this number of characters will not be extended, even if the next _element_ would fit without exceeding the specified hard maximum. This can be used in conjunction with `max_characters` to set a “preferred” size, like “I prefer chunks of around 1000 characters, but I’d rather have a chunk of 1500 (max_characters) than resort to text-splitting”. This would be specified with `(..., max_characters=1500, new_after_n_chars=1000)`.

- `overlap`: (default=0) - only when using text-splitting to break up an oversized chunk, include this number of characters from the end of the prior chunk as a prefix on the next. This can mitigate the effect of splitting the semantic unit represented by the oversized element at an arbitrary position based on text length.

- `combine_text_under_n_chars argument`: This defaults to the same value as `max_characters` such that sequential small section chunks are combined to maximally fill the chunking window to produce a logically larger chunk


There are currently two chunking strategies, `basic` and `by_title`.

The `basic` strategy combines sequential elements to maximally fill each chunk while respecting both the specified max_characters (hard-max) and new_after_n_chars (soft-max) option values.

The `by_title` chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections.

In [72]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Error loading punkt_tab: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1006)>


False

In [73]:
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Error loading averaged_perceptron_tagger_eng: <urlopen
[nltk_data]     error [SSL: CERTIFICATE_VERIFY_FAILED] certificate
[nltk_data]     verify failed: unable to get local issuer certificate
[nltk_data]     (_ssl.c:1006)>


False

In [75]:
from langchain_community.document_loaders import UnstructuredPDFLoader

# takes 3-4 mins on Colab
loader = UnstructuredPDFLoader('../../docs/layoutparser_paper.pdf',
                               strategy='hi_res',
                               extract_images_in_pdf=False,
                               infer_table_structure=True,
                               chunking_strategy="by_title",
                               max_characters=4000,
                               new_after_n_chars=3800,
                               combine_text_under_n_chars=2000,
                               mode='elements')
data = loader.load()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [76]:
len(data)

16

In [77]:
[doc.metadata['category'] for doc in data]

['CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement', 'CompositeElement']

In [78]:
data[0]

Document(metadata={'source': '../../docs/layoutparser_paper.pdf', 'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2025-05-30T10:16:46', 'page_number': 1, 'orig_elements': 'eJzNWllv3EYS/iu9ekoW0wz74qGXtWMDG2ed3cBxNkEcw+ijONM2hxzwkCwb/u9b3SRljUfxYgyMMQ+CVMWuPr766mhSL95fQA1baIZX3l1ckouqdCJVzFGTCk0ly3Na2kJQkRrLsxKy1BUXK3KxhUE7PWi0eX9h27ZzvtED9FGu9U07Dq824NebATWcpynazOpr74YNalketbvWN0Owe/FCqkSuiFIykS9XZBZZKnjCgszSNCnvUUwGqLjob/oBtuEcP/u3UP+y0xYuPuADBwPYwbfNK1vrvn+161qDw9JEZTJjOKDyNQw3O4i2P/90EbfbrEe9jmd6cQHN+uJl1PbDq23rfOUhIsZTrmiqEKHnLL1k2aXMgvUOLV8149ZAF84aNjHA24DGBSOcpPjTkJH8SIL0kvyXPCIJ6YPlso0fQDs0RstPnZSnkimBywprJJU8Lai2TtKSg+Ast2XJshM7iaVcJPyul2SelHteOlBEi8+66Wt6wd7F+tfGIjDrtvPvwD0PI+6BXVooMAIMzbQEjA2ZUpMrTmUlS/wtDXNwctgXVBc5y6ZguUX5QBEtzgb2F0fDrrVyzglOmRDI9lQwqllRUCelzgqldG7kyWEv5R7bGWf75D5QTBZnAzs/GnarnOQcU78EQLYrA7RwpqTAXV4aXlnH9MlgV2mSRhBVkgZUZ1kpMclFca88jT8f0LFUJZipZXHFie5+91dHeyF1qdJlSDc6uKKwBS04x6JccXAFL/KyyE9N/ltuL7Isk3yP/AeKaHE2fjgedqHKwnKbUeM4tkHS5RSzk

In [79]:
print(data[0].page_content)

1 2 0 2 n u J 1 2 ] V C . s

c

[

2

2103.15348v2 arXiv

v

8

4

3

5

1

.

3

0

1

2

:

v

i

X

r

a

LayoutParser: A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis

Zejiang Shen! (4), Ruochen Zhang”, Melissa Dell?, Benjamin Charles Germain Lee*, Jacob Carlson’, and Weining Li®

1 Allen Institute for AI shannons@allenai.org 2 Brown University ruochen zhang@brown.edu 3 Harvard University {melissadell,jacob carlson}@fas.harvard.edu 4 University of Washington bcgl@cs.washington.edu 5 University of Waterloo w422li@uwaterloo.ca

Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy reuse of im- portant innovations by a wide audience. Though there have been on-going eﬀorts to impro

In [80]:
data[1]

Document(metadata={'source': '../../docs/layoutparser_paper.pdf', 'filetype': 'application/pdf', 'languages': ['eng'], 'last_modified': '2025-05-30T10:16:46', 'page_number': 1, 'orig_elements': 'eJy9WtuO3LgR/RVinjzAUNCFuvnNiROsAScxsBMEWMcwKIpqKaOWZFGacXux/55TpLpH3dP2rmfRhh/GKpEUWXXq1IX9/tcr3eqt7qaPTXn1kl1pJaSUhc/DRKRc6DLghapi7udCRUJmkIqrG3a11ZMs5SQx59cr1fdj2XRy0sY+t3LXz9PHWjebeoIkDH0fcxbxQ1NONaRBaqVD33QTzXv/PkqFF93gReh7wYcb9iiIIy8iQRIlXnxWYKdAcmV2ZtJbOsu75rNufx6k0le/4UWpJ62mpu8+qlYa83EY+wLDfC/LkzTFgKpp9bQbtJ377h9XdsvdZpYbe673V7rbXH2wUjN93PZlUzXaai30Q6go5pF/G/gvg+SlSGj2gJkfu3lb6JHOS5uY9GfSyFXA3nTT2Jez3REN3n/5tplau+FT0/ixX1WVkDxOYB9R6IwXhZ/xqKx8v0pkVJTJBU0TeoL0nMVe7kzjBJlIvIwEQRRlXnZW4iY9zzh5mMThJYzTNt2dnfnrlZnkCC13pf5Mh4+TtaECGjyPLT2oZtJeLcdW70I/iPW9bGdpDfjbh2/b+7XWA3ur5dg13ebF67fXvJBGl0wOOKlUtTZMjppNtWbYzaR5X3E8cGyMVf3IJINtNBtxZM36ipW9mgkbrNnio0x2st2ZxrAXr9+8umaTNHeGNZ1qZxh/czraqrj571wVfqDs/mGs4GaNwn/KccSbe31LBziDxqwqSxEmmkdCgyhSi8Yw5LEIZV6oOAny+HJoTHIPHhvGqScsGJfnLPRiywpB4uVnnu345+EwS

In [81]:
print(data[1].page_content)

1 Introduction

Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classiﬁcation [11,

2 Z. Shen et al.

37], layout detection [38, 22], table detection [26], and scene text detection [4]. A generalized learning-based framework dramatically reduces the need for the manual speciﬁcation of complicated rules, which is the status quo with traditional methods. DL has the potential to transform DIA pipelines and beneﬁt a broad spectrum of large-scale document digitization projects.

However, there are several practical diﬃculties for taking advantages of re- cent advances in DL-based methods: 1) DL models are notoriously convoluted for reuse and extension. Existing models are developed using distinct frame- works like TensorFlow [1] or PyTorch [24], and the high-level parameters can be obfuscated by implementation details [8]. It can be a time-consuming and frustrating experience to debug, reproduce, an