In [2]:
!pip install sentence_transformers
!pip install langchain_community
!pip install pypdf
!pip install langchain-experimental

In [None]:
import sys
import os
sys.path.append(os.path.abspath(".."))
from langchain.schema import Document
from Modules import scipy_chunking_helper as chunking_helper

# Chunking Methods for RAG Pipelines
---
This notebook demonstrates several methods to split text documents into useful chunks for retrieval-augmented generation (RAG) using LangChain and related libraries.

## 1. Sample Text

In [5]:
sample_text = """# Introduction
Welcome to our demo on text chunking! This section introduces chunking methods and explains why splitting long documents into smaller pieces is critical for efficient retrieval and generation.

## What is Chunking?
Chunking is the process of dividing text into manageable segments.
- Character-based chunking splits purely by length.
- Recursive chunking uses natural text breaks, like paragraphs or sentences.
- Semantic chunking finds topic changes.

Here is a list:
1. Fast and easy: character-based.
2. Natural: recursive or markdown-based.
3. Context-aware: semantic.

## An Example with a Superlongword
Sometimes, data includes strange artifacts like:
ThisIsASingleUnbreakableSupercalifragilisticexpialidociousWordThatExceedsTheChunkSizeLimitAndCausesTroubleForSplitters.

## Topic Change: Semantic Matters
Chunkers that consider **meaning** will split here, as the topic shifts from chunking methods to why semantics matter.
Semantic chunking is especially useful when there are clear boundaries in ideas or narrative flow, even if there's no line break or heading.

In summary, choose your chunker based on your data and task!"""
doc = Document(page_content=sample_text)

## 2. Different Chunking Methods
### 2.1 Character Splitter
Splits text into fixed-length character chunks (with optional overlap).

In [19]:
char_chunks = chunking_helper.fixed_length_chunking(doc, 200, 0, '')
print(f"Total chunks: {len(char_chunks)}")

Total chunks: 6


In [10]:
chunking_helper.print_chunks(char_chunks)

Chunk 1 (length: 200):
# Introduction
Welcome to our demo on text chunking! This section introduces chunking methods and explains why splitting long documents into smaller pieces is critical for efficient retrieval and gene
----------------------------------------
Chunk 2 (length: 200):
ration.

## What is Chunking?
Chunking is the process of dividing text into manageable segments.
- Character-based chunking splits purely by length.
- Recursive chunking uses natural text breaks, like
----------------------------------------
Chunk 3 (length: 199):
paragraphs or sentences.
- Semantic chunking finds topic changes.

Here is a list:
1. Fast and easy: character-based.
2. Natural: recursive or markdown-based.
3. Context-aware: semantic.

## An Examp
----------------------------------------
Chunk 4 (length: 200):
le with a Superlongword
Sometimes, data includes strange artifacts like:
ThisIsASingleUnbreakableSupercalifragilisticexpialidociousWordThatExceedsTheChunkSizeLimitAndCausesTroubleForS

### 2.2 Recursive Splitter
Attempts to split text by different logical separators (e.g., paragraphs, sentences) for more "natural" chunks.

In [18]:
recursive_chunks = chunking_helper.recursive_chunking(doc, 200, 0)
print(f"Total chunks: {len(recursive_chunks)}")

Total chunks: 10


In [12]:
chunking_helper.print_chunks(recursive_chunks)

Chunk 1 (length: 14):
# Introduction
----------------------------------------
Chunk 2 (length: 192):
Welcome to our demo on text chunking! This section introduces chunking methods and explains why splitting long documents into smaller pieces is critical for efficient retrieval and generation.
----------------------------------------
Chunk 3 (length: 139):
## What is Chunking?
Chunking is the process of dividing text into manageable segments.
- Character-based chunking splits purely by length.
----------------------------------------
Chunk 4 (length: 117):
- Recursive chunking uses natural text breaks, like paragraphs or sentences.
- Semantic chunking finds topic changes.
----------------------------------------
Chunk 5 (length: 119):
Here is a list:
1. Fast and easy: character-based.
2. Natural: recursive or markdown-based.
3. Context-aware: semantic.
----------------------------------------
Chunk 6 (length: 83):
## An Example with a Superlongword
Sometimes, data includes strange artif

### 2.3 Semantic Chunking
Chunking based on **semantic similarity**—splitting where the topic shifts rather than at fixed lengths.  
We'll use `sentence-transformers` to embed sentences and split at points with high semantic "distance".

In [23]:
semantic_chunks = chunking_helper.semantic_chunking(doc, "sentence-transformers/all-mpnet-base-v2", "percentile", 70)
print(f"Total chunks: {len(semantic_chunks)}")

Total chunks: 6


In [24]:
chunking_helper.print_chunks(semantic_chunks)

Chunk 1 (length: 465):
# Introduction
Welcome to our demo on text chunking! This section introduces chunking methods and explains why splitting long documents into smaller pieces is critical for efficient retrieval and generation. ## What is Chunking? Chunking is the process of dividing text into manageable segments. - Character-based chunking splits purely by length. - Recursive chunking uses natural text breaks, like paragraphs or sentences. - Semantic chunking finds topic changes.
----------------------------------------
Chunk 2 (length: 18):
Here is a list:
1.
----------------------------------------
Chunk 3 (length: 31):
Fast and easy: character-based.
----------------------------------------
Chunk 4 (length: 43):
2. Natural: recursive or markdown-based. 3.
----------------------------------------
Chunk 5 (length: 24):
Context-aware: semantic.
----------------------------------------
Chunk 6 (length: 558):
## An Example with a Superlongword
Sometimes, data includes strange artifac

### 2.4 Semantic Chunking with Chunk Size Controlled
Using the same logic, but


*   Merge small chunks to the previous one
*   Set up the max chunk size



In [16]:
semantic_chunks_2 = chunking_helper.semantic_chunking_improved(doc, "all-mpnet-base-v2", "percentile", 70)
print(f"Total chunks: {len(semantic_chunks_2)}")

Total chunks: 5


In [20]:
chunking_helper.print_chunks(semantic_chunks_2)

Chunk 1 (length: 52):
# Introduction
Welcome to our demo on text chunking!
----------------------------------------
Chunk 2 (length: 532):
This section introduces chunking methods and explains why splitting long documents into smaller pieces is critical for efficient retrieval and generation. ## What is Chunking? Chunking is the process of dividing text into manageable segments. - Character-based chunking splits purely by length. - Recursive chunking uses natural text breaks, like paragraphs or sentences. - Semantic chunking finds topic changes. Here is a list:
1. Fast and easy: character-based. 2. Natural: recursive or markdown-based. 3. Context-aware: semantic.
----------------------------------------
Chunk 3 (length: 203):
## An Example with a Superlongword
Sometimes, data includes strange artifacts like:
ThisIsASingleUnbreakableSupercalifragilisticexpialidociousWordThatExceedsTheChunkSizeLimitAndCausesTroubleForSplitters.
----------------------------------------
Chunk 4 (length: 15