# Data Chunking Lab: Text Splitting Strategies in LangChain

## Overview

In this tutorial, we'll explore different text chunking strategies used in Retrieval-Augmented Generation (RAG) systems and Large Language Model (LLM) applications. Chunking is the process of breaking down large documents into smaller, manageable pieces that can be processed, embedded, and retrieved effectively.

### Why is Chunking Important?

- **Token Limits**: LLMs have context window limitations
- **Retrieval Precision**: Smaller chunks allow for more precise semantic search
- **Cost Efficiency**: Processing smaller chunks is more cost-effective
- **Context Preservation**: Good chunking maintains semantic meaning

### What You'll Learn

1. **Fixed-Size Chunking**: Simple character-based splitting
2. **Recursive Character Chunking**: Intelligent context-preserving splitting
3. **Sentence-Based Chunking**: Splitting by natural sentence boundaries
4. **Semantic Chunking**: Splitting by paragraph/logical boundaries

Let's get started!

## Step 1: Import Required Libraries

We'll use LangChain's document loaders and text splitters, along with NLTK for sentence tokenization.

In [1]:
# Import required libraries
import os
import nltk
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from nltk.tokenize import sent_tokenize

## Step 2: Download Required NLTK Data

NLTK requires certain data files for tokenization. We'll download the required 'punkt' tokenizer data.

In [2]:
# Download required NLTK data for sentence tokenization
try:
    nltk.data.find('tokenizers/punkt')
    print("✅ NLTK punkt tokenizer already available")
except LookupError:
    print("📥 Downloading NLTK punkt tokenizer...")
    nltk.download('punkt')
    print("✅ NLTK punkt tokenizer downloaded successfully")

✅ NLTK punkt tokenizer already available


## Step 3: Load PDF Document

We'll create a function to load and extract text from a PDF file using LangChain's PyMuPDFLoader. This loader automatically extracts text and creates document chunks that preserve page information.

In [3]:
# Define a function to load and extract text from PDF
def load_pdf_with_langchain(pdf_path):
    
    # Use LangChain's built-in loader
    loader = PyMuPDFLoader(pdf_path)
    
    # Load the PDF into LangChain's document format
    documents = loader.load()
    
    print(f"Successfully loaded {len(documents)} document chunks from the PDF.")
    return documents

## Step 4: Load Your Healthcare Document

Now let's load an actual PDF document. Replace the path below with your own healthcare-related PDF file.

In [5]:
# Path to the uploaded PDF (replace with your actual file path)
pdf_path = "./data/41598_2020_Article_64454.pdf"

# Extract the document chunks
docs = load_pdf_with_langchain(pdf_path)

Successfully loaded 13 document chunks from the PDF.


---

# Method 1: Fixed-Size Chunking

## Understanding Fixed-Size Chunking

Fixed-size chunking is the simplest approach where we break text into chunks based on a fixed number of characters. While simple and fast, it may cut off sentences halfway, potentially losing context.

### Key Parameters:

#### `chunk_size`
The number of characters in each chunk.
- **Example**: `chunk_size=500` creates chunks of 500 characters
- Determines how much text is in each piece

#### `chunk_overlap`
The number of characters repeated between chunks to keep context.
- **Example**: `chunk_overlap=50` means the last 50 characters of one chunk appear at the start of the next
- Use overlap to avoid cutting off sentences or breaking flow across chunks
- Helps maintain continuity and context between adjacent chunks

### Pros and Cons:

**Pros:**
- ✅ Simple and fast
- ✅ Predictable chunk sizes
- ✅ Good for well-structured data

**Cons:**
- ❌ May cut sentences halfway
- ❌ Can break logical flow
- ❌ May confuse language models with incomplete thoughts

## Types of Fixed Chunking in LangChain

LangChain provides two main approaches to fixed-size text splitting:

### 1. CharacterTextSplitter
- Splits text into chunks based on a fixed number of characters
- Uses a single separator (e.g., space `" "` or newline `"\n"`)
- If a sentence or paragraph is too long, it may split mid-sentence
- **Best for**: Simple and fast splitting, but may not always preserve context cleanly

### 2. RecursiveCharacterTextSplitter
- Tries to split text at natural boundaries: paragraphs → sentences → words → characters
- Uses a list of separators and recursively falls back if cleaner splits aren't possible
- Produces better-structured chunks with less context loss
- **More advanced**: Preserves semantic meaning better
- **Recommended**: Use `RecursiveCharacterTextSplitter` when you want smart chunking, especially for documents with mixed formatting or longer sentences

In [6]:
# Define a function to split text into fixed-size character chunks using LangChain's CharacterTextSplitter.

def fixed_size_chunking(docs, chunk_size=500, chunk_overlap=50):
    
    splitter = CharacterTextSplitter(
        separator=" ",
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    return splitter.split_documents(docs)


# Apply it
fixed_chunks = fixed_size_chunking(docs)
print(f" Total fixed-size chunks: {len(fixed_chunks)}\n")
print(f" Example:First Chunk \n{fixed_chunks[0].page_content[:]}")

 Total fixed-size chunks: 133

 Example:First Chunk 
1
Scientific Reports | (2020) 10:7483 | https://doi.org/10.1038/s41598-020-64454-x
www.nature.com/scientificreports
Inhibitory action of 
phenothiazinium dyes against 
Neospora caninum
Luiz Miguel Pereira1,2, Caroline Martins Mota   3, Luciana Baroni1, 
Cássia Mariana Bronzon da Costa1, Jade Cabestre Venancio Brochi1, Mark Wainwright4, 
Tiago Wilson Patriarca Mineo   3, Gilberto Úbida Leite Braga1 & Ana Patrícia Yatsuda1,2 ✉
Neospora caninum is an Apicomplexan parasite related to important


In [7]:
# Splits documents using RecursiveCharacterTextSplitter which preserves context better

def recursive_chunking(docs, chunk_size=500, chunk_overlap=50):
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    
    return splitter.split_documents(docs)


# Apply it
recursive_chunks = recursive_chunking(docs)
print(f" Total recursive chunks: {len(recursive_chunks)}\n")
print(f" Example: First Chunk \n{recursive_chunks[0].page_content[:]}")

 Total recursive chunks: 135

 Example: First Chunk 
1
Scientific Reports |         (2020) 10:7483  | https://doi.org/10.1038/s41598-020-64454-x
www.nature.com/scientificreports
Inhibitory action of 
phenothiazinium dyes against 
Neospora caninum
Luiz Miguel Pereira1,2, Caroline Martins Mota   3, Luciana Baroni1,  
Cássia Mariana Bronzon da Costa1, Jade Cabestre Venancio Brochi1, Mark Wainwright4, 
Tiago Wilson Patriarca Mineo   3, Gilberto Úbida Leite Braga1 & Ana Patrícia Yatsuda1,2 ✉


## 🔍 Observation: Fixed-Size Chunking

### What We've Learned:

1. **CharacterTextSplitter**:
   - The `fixed_size_chunking()` function splits text based on a specific number of characters (e.g., 500)
   - Simple and fast implementation
   - **Limitation**: Does not respect sentence boundaries
   - This means a chunk might start or end mid-sentence, which can confuse the language model

2. **The Role of `chunk_overlap`**:
   - Adding `chunk_overlap` (e.g., 50 characters) helps keep context between chunks
   - The last 50 characters of one chunk appear at the start of the next
   - This reduces information loss at chunk boundaries

3. **RecursiveCharacterTextSplitter**:
   - More intelligent than simple character splitting
   - Attempts to split at natural boundaries (paragraphs, then sentences, then words)
   - Produces cleaner, more contextually coherent chunks

### When to Use Fixed-Size Chunking:
- Quick setups or prototypes
- Well-structured data where sentence cuts aren't a big issue
- When processing speed is more important than perfect context preservation

---

# Method 2: Sentence-Based Chunking

## Understanding Sentence-Based Chunking

Instead of splitting by character count, sentence-based chunking splits text into groups of complete sentences. This approach is more readable and retains better semantic meaning.

### Key Parameter:

#### `sentences_per_chunk`
- Defines how many sentences to include in each chunk
- **Example**: `sentences_per_chunk=3` will group 3 sentences together as one chunk
- Each chunk is a complete thought or mini-paragraph

### Why Use Sentence-Based Chunking?

**Advantages:**
- ✅ Keeps chunks meaningful and readable
- ✅ Respects natural language flow
- ✅ Makes each chunk easier for the model to understand
- ✅ Especially helpful when documents have clear sentence structures

**Best For:**
- Research papers
- Articles
- Long-form text where sentence flow matters

### How It Works:
1. Use NLTK's `sent_tokenize()` to split text into sentences
2. Group sentences into chunks of N sentences
3. Each chunk contains complete, grammatically correct sentences

In [8]:
# Define a function to split each page into chunks of N sentences.

def sentence_based_chunking(docs, sentences_per_chunk=3):
    
    chunks = []
    
    for doc in docs:
        sentences = sent_tokenize(doc.page_content)
        for i in range(0, len(sentences), sentences_per_chunk):
            chunk_text = " ".join(sentences[i:i + sentences_per_chunk])
            chunks.append(chunk_text)
    
    return chunks


sentence_chunks = sentence_based_chunking(docs)
print(f" Total sentence-based chunks: {len(sentence_chunks)}\n")
print(f" Example:\n{sentence_chunks[0][:]}")

 Total sentence-based chunks: 240

 Example:
1
Scientific Reports |         (2020) 10:7483  | https://doi.org/10.1038/s41598-020-64454-x
www.nature.com/scientificreports
Inhibitory action of 
phenothiazinium dyes against 
Neospora caninum
Luiz Miguel Pereira1,2, Caroline Martins Mota   3, Luciana Baroni1,  
Cássia Mariana Bronzon da Costa1, Jade Cabestre Venancio Brochi1, Mark Wainwright4, 
Tiago Wilson Patriarca Mineo   3, Gilberto Úbida Leite Braga1 & Ana Patrícia Yatsuda1,2 ✉
Neospora caninum is an Apicomplexan parasite related to important losses in livestock, causing 
abortions and decreased fertility in affected cows. Several chemotherapeutic strategies have been 
developed for disease control; however, no commercial treatment is available. Among the candidate 
drugs against neosporosis, phenothiazinium dyes, offer a low cost-efficient approach to parasite 
control.


## 🔍 Observation: Sentence-Based Chunking

### What We've Learned:

1. **Natural Language Flow**:
   - The `sentence_based_chunking()` function breaks text into chunks of a fixed number of sentences (e.g., 3)
   - This method respects natural language flow
   - Each chunk is easier for the model to understand because it contains complete thoughts

2. **Context Preservation**:
   - Unlike character-based chunking, sentences are never cut mid-way
   - Each chunk represents a coherent unit of meaning
   - Better for downstream tasks like question answering and summarization

3. **Especially Helpful For**:
   - Documents with clean sentence structures
   - Academic papers and articles
   - Content where maintaining grammatical integrity is important

### Best Use Cases:
- Research papers, articles, and long-form text where sentence flow matters
- When you need chunks that are semantically complete
- Documents with well-formed sentences and paragraphs

---

# Method 3: Semantic Chunking

## Understanding Semantic Chunking

Semantic chunking takes the most intelligent approach by splitting content based on paragraph breaks or logical boundaries. This method helps preserve the meaning of each idea or concept.

### How It Works:

Instead of counting characters or sentences, semantic chunking looks for natural breaks in the document:
- **Paragraph boundaries** (indicated by `\n\n` - two newlines)
- **Section breaks**
- **Logical topic changes**

### Why Use Semantic Chunking?

**Advantages:**
- ✅ Preserves complete ideas and concepts
- ✅ Each chunk represents a logical unit of information
- ✅ Best for documents with clear structural formatting
- ✅ Reduces context fragmentation
- ✅ Ideal for embedding generation and semantic search

**Best For:**
- Structured PDFs (like research papers with clear paragraphs)
- Reports with proper formatting
- Documents where each paragraph discusses a distinct topic

### Important Note:
This method works best when the original text is **well-formatted with clear paragraph structure**. If the document has inconsistent formatting, you might get uneven chunk sizes.

In [9]:
# Define a function to split based on paragraph breaks (using two newlines)

def semantic_chunking(docs):
    
    chunks = []
    for doc in docs:
        paragraphs = doc.page_content.split("\n\n")
        for para in paragraphs:
            cleaned = para.strip()
            if cleaned:
                chunks.append(cleaned)
    return chunks


semantic_chunks = semantic_chunking(docs)
print(f" Total semantic chunks: {len(semantic_chunks)}\n")
print(f" Example:\n{semantic_chunks[0][:]}")

 Total semantic chunks: 13

 Example:
1
Scientific Reports |         (2020) 10:7483  | https://doi.org/10.1038/s41598-020-64454-x
www.nature.com/scientificreports
Inhibitory action of 
phenothiazinium dyes against 
Neospora caninum
Luiz Miguel Pereira1,2, Caroline Martins Mota   3, Luciana Baroni1,  
Cássia Mariana Bronzon da Costa1, Jade Cabestre Venancio Brochi1, Mark Wainwright4, 
Tiago Wilson Patriarca Mineo   3, Gilberto Úbida Leite Braga1 & Ana Patrícia Yatsuda1,2 ✉
Neospora caninum is an Apicomplexan parasite related to important losses in livestock, causing 
abortions and decreased fertility in affected cows. Several chemotherapeutic strategies have been 
developed for disease control; however, no commercial treatment is available. Among the candidate 
drugs against neosporosis, phenothiazinium dyes, offer a low cost-efficient approach to parasite 
control. We report the anti-parasitic effects of the phenothiaziums Methylene Blue (MB), New 
Methylene Blue (NMB), 1,9–Dimethyl Me

## 🔍 Observation: Semantic Chunking

### What We've Learned:

1. **Logical Boundaries**:
   - Semantic chunking splits text at paragraph boundaries (double newlines `\n\n`)
   - Each chunk represents a complete idea or concept
   - This is the most "natural" way to split documents

2. **Context Preservation**:
   - Maintains the semantic integrity of each section
   - Each chunk is a self-contained unit of meaning
   - Minimal loss of context between chunks

3. **When It Works Best**:
   - Great when the original text is well-formatted with clear paragraph structure
   - Ideal for structured PDFs or reports with proper formatting
   - Perfect for research papers where each paragraph discusses a distinct topic

4. **Limitations**:
   - Requires documents to have consistent formatting
   - May produce variable-sized chunks (some paragraphs are longer than others)
   - Not ideal for poorly formatted documents

### Best Use Cases:
- Structured PDFs or reports with proper formatting (like research papers)
- Documents where maintaining topic coherence is critical
- When you need each chunk to represent a complete concept or idea

---

# Comparing All Methods

## Side-by-Side Comparison

Now let's compare how each method chunks the same document. We'll look at the chunk at index 6 to see the differences clearly.

In [10]:
fixed_chunks = fixed_size_chunking(docs)
#print(f" Total fixed-size chunks: {len(fixed_chunks)}")
print(f" Fixed-size chunk at index 6: \n{fixed_chunks[6].page_content[:]}\n")

recursive_chunks = recursive_chunking(docs)
#print(f" Total recursive chunks: {len(recursive_chunks)}")
print(f" Recursive chunk at index 6: \n{recursive_chunks[6].page_content[:]}\n")

sentence_chunks = sentence_based_chunking(docs)
#print(f" Total sentence-based chunks: {len(sentence_chunks)}")
print(f" Sentence-based chunk at index 6:\n{sentence_chunks[6][:]}\n")

semantic_chunks = semantic_chunking(docs)
#print(f" Total semantic chunks: {len(semantic_chunks)}")
print(f" Semantic chunk at index 6:\n{semantic_chunks[6][:]}")

 Fixed-size chunk at index 6: 
against Plasmodium spp, the etiologic agent of malaria, 
our group determined the efficacy of this molecule on N. caninum, either alone or combined with Pyrimethamine 
(Pyr)27. MB, a phenothiazinium dye, was the first synthetic drug described to cure a patient affected by malaria in 
the XIX century, pioneering work performed by Paul Ehrlich28,29. Indeed, MB was used against Plasmodium until 
the use of Chloroquine and other drugs (massively utilized after the Second World War), which lack some

 Recursive chunk at index 6: 
our group determined the efficacy of this molecule on N. caninum, either alone or combined with Pyrimethamine 
(Pyr)27. MB, a phenothiazinium dye, was the first synthetic drug described to cure a patient affected by malaria in 
the XIX century, pioneering work performed by Paul Ehrlich28,29. Indeed, MB was used against Plasmodium until 
the use of Chloroquine and other drugs (massively utilized after the Second World War), which lack 

## 📊 Key Observations

### What You'll Notice:

If we look closely, we'll see that **the chunk at index 0 contains the same content across all the methods** (Fixed Chunking, Recursive Chunking, Sentence-Based Chunking, and Semantic Chunking).

However, **when we invoke the chunk at index 6, the differences among the methods become clear**:

1. **Fixed-Size Chunking (CharacterTextSplitter)**:
   - May cut sentences mid-way
   - Less readable and may lose context
   - Fastest but least intelligent

2. **Recursive Chunking (RecursiveCharacterTextSplitter)**:
   - Tries to preserve natural boundaries
   - Better context preservation than fixed-size
   - Good balance of speed and quality

3. **Sentence-Based Chunking**:
   - Always contains complete sentences
   - More readable and coherent
   - Better for maintaining grammatical structure

4. **Semantic Chunking**:
   - Contains complete paragraphs or logical sections
   - Best preserves the author's intended structure
   - Most contextually coherent
   - Ideal for downstream tasks like embedding generation

### The Takeaway:
While all methods may produce similar results for early chunks, their **structural differences become apparent** as you go deeper into the document. Choose your chunking method based on:
- Your document's structure
- Your downstream task requirements
- The balance between speed and quality you need

---

# 🎯 Hands-On Activity

## Your Task:

Given a healthcare-related article or PDF, use both `CharacterTextSplitter` and `RecursiveCharacterTextSplitter` from LangChain to chunk the text.

**Requirements:**
1. Allow the values for `chunk_size` and `chunk_overlap` to be set dynamically by the user
2. Implement both chunking methods (CharacterTextSplitter and RecursiveCharacterTextSplitter) using the user-defined chunk_size and chunk_overlap
3. Print and compare the first 3 chunks generated by each method
4. Reflect on the differences in structure, readability, and context continuity between the two methods

## What You Need to Do:

1. **Implement both chunking methods** (`CharacterTextSplitter` and `RecursiveCharacterTextSplitter`) using the user-defined `chunk_size` and `chunk_overlap`

2. **Print and compare the first 3 chunks** generated by each method

3. **Reflect on the differences** in structure, readability, and context continuity between the two methods

---

## 🤔 Reflection Question:

**Which chunking method better preserves the original structure and meaning of the content — and why might that be important for downstream tasks like embeddings, semantic retrieval, or LLM-based answer generation?**

Consider:
- How each method handles sentence boundaries
- The impact on semantic coherence
- Trade-offs between simplicity and context preservation
- Which method would work best for your specific use case

---

## 💡 Tips for Your Implementation:

1. Start with reasonable values like `chunk_size=500` and `chunk_overlap=50`
2. Experiment with different values to see how they affect the output
3. Pay attention to how sentences are split (or not split) in each method
4. Consider the readability of each chunk

Try it out in the cell below!

In [None]:
# Your code here!
# TODO: Implement both CharacterTextSplitter and RecursiveCharacterTextSplitter
# TODO: Compare the first 3 chunks from each method
# TODO: Write your observations and reflection



---

# 📚 Summary and Best Practices

## Chunking Strategy Cheat Sheet:

| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **Fixed-Size (CharacterTextSplitter)** | Quick prototypes, well-structured data | Fast, predictable | May cut sentences |
| **Recursive (RecursiveCharacterTextSplitter)** | Most general use cases | Smart splitting, good balance | Slightly slower |
| **Sentence-Based** | Clean text with clear sentences | Readable, grammatically correct | Requires sentence detection |
| **Semantic (Paragraph-Based)** | Well-formatted documents | Preserves complete ideas | Needs structured input |

## Key Takeaways:

1. **Start with RecursiveCharacterTextSplitter** - It's a safe default for most use cases
2. **Use chunk_overlap** - Helps maintain context across boundaries (typically 10-20% of chunk_size)
3. **Consider your downstream task** - Embedding models and retrieval systems benefit from coherent chunks
4. **Test and iterate** - What works for one document type may not work for another
5. **Balance chunk size** - Too small loses context, too large may exceed token limits

## Next Steps:

- Experiment with different chunk sizes for your specific use case
- Try combining methods (e.g., semantic chunking followed by size-based splitting)
- Consider using more advanced techniques like embedding-based semantic chunking
- Test how different chunking strategies affect your RAG system's performance

Happy chunking! 🚀