# Chunking strategies and search strategies for vector embeddings


**Levels Of Text Splitting**
* **Level 1: Character Splitting** - Simple static character chunks of data
* **Level 2: Recursive Character Text Splitting** - Recursive chunking based on a list of separators
* **Level 3: Document Specific Splitting** - Various chunking methods for different document types (PDF, Python, Markdown)
* **Level 4: Semantic Splitting** - Embedding based chunking



Ref: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb

## Level 1: Character Splitting
Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character sized chunks regardless of their content or form.


* **Chunk Size** - The number of characters you would like in your chunks. 50, 100, 100,000, etc.
* **Chunk Overlap** - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.


In [None]:
text = "GenAI learns from existing data and uses it to generate new data with similar characteristics. For example, it can create text, images, videos, sounds, code, and 3D designs."

In [None]:
from langchain.text_splitter import CharacterTextSplitter

In [None]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='', strip_whitespace=False)

Then we can actually split our text via `create_documents`. Note: `create_documents` expects a list of texts, so if you just have a string (like we do) you'll need to wrap it in `[]`

In [None]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='GenAI learns from existing data and'),
 Document(metadata={}, page_content=' uses it to generate new data with '),
 Document(metadata={}, page_content='similar characteristics. For exampl'),
 Document(metadata={}, page_content='e, it can create text, images, vide'),
 Document(metadata={}, page_content='os, sounds, code, and 3D designs.')]

In above docs, in page_content we can there is white space at the end of some of the splitted texts, we can remove it by using `strip_whitespace=True` as follows:

In [None]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='', strip_whitespace=True)

In [None]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='GenAI learns from existing data and'),
 Document(metadata={}, page_content='uses it to generate new data with'),
 Document(metadata={}, page_content='similar characteristics. For exampl'),
 Document(metadata={}, page_content='e, it can create text, images, vide'),
 Document(metadata={}, page_content='os, sounds, code, and 3D designs.')]

**Chunk Overlap**

**Chunk overlap** will blend together our chunks so that the tail of Chunk #1 will be the same thing as the head of Chunk #2 and so on and so forth.

In [None]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=4, separator='')

In [None]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='GenAI learns from existing data and'),
 Document(metadata={}, page_content='and uses it to generate new data w'),
 Document(metadata={}, page_content='ta with similar characteristics. Fo'),
 Document(metadata={}, page_content='. For example, it can create text,'),
 Document(metadata={}, page_content='xt, images, videos, sounds, code, a'),
 Document(metadata={}, page_content='e, and 3D designs.')]

Notice how we have the same chunks, but now there is overlap between 1 & 2 and 2 & 3 and so on...

**Separators**

**Separators** are sequence of characters, that we want out text to split on. It also removes that sequence from text.


Lets chunk at `da`

In [None]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='da')

In [None]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='GenAI learns from existing'),
 Document(metadata={}, page_content='ta and uses it to generate new'),
 Document(metadata={}, page_content='ta with similar characteristics. For example, it can create text, images, videos, sounds, code, and 3D designs.')]

## Level 2: Recursive Character Text Splitting

In Recursive Character Text Splitter we'll specify a series of separatators which will be used to split our docs.

Default ones are:
* "\n\n" - Double new line, or most commonly paragraph breaks
* "\n" - New lines
* " " - Spaces
* "" - Characters



In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
text = """
Generative AI (GenAI) is a type of artificial intelligence (AI) that can create new content and ideas based on existing data:

How it works

GenAI learns from existing data and uses it to generate new data with similar characteristics. For example, it can create text, images, videos, sounds, code, and 3D designs.

How it's used

GenAI is used in many industries, including software development, healthcare, finance, entertainment, and customer service.
Examples
"""

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)

In [None]:
text_splitter.create_documents([text])

[Document(metadata={}, page_content='Generative AI (GenAI) is a type of artificial intelligence (AI)'),
 Document(metadata={}, page_content='that can create new content and ideas based on existing data:'),
 Document(metadata={}, page_content='How it works'),
 Document(metadata={}, page_content='GenAI learns from existing data and uses it to generate new data'),
 Document(metadata={}, page_content='with similar characteristics. For example, it can create text,'),
 Document(metadata={}, page_content='images, videos, sounds, code, and 3D designs.'),
 Document(metadata={}, page_content="How it's used"),
 Document(metadata={}, page_content='GenAI is used in many industries, including software'),
 Document(metadata={}, page_content='development, healthcare, finance, entertainment, and customer'),
 Document(metadata={}, page_content='service.'),
 Document(metadata={}, page_content='Examples')]

the splitter first looks for double new lines (paragraph break).

Once paragraphs are split, then it looks at the chunk size, if a chunk is too big, then it'll split by the next separator. If the chunk is still too big, then it'll move onto the next one and so forth.

let's split on larger chunk size.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 200, chunk_overlap=0)
text_splitter.create_documents([text])

[Document(metadata={}, page_content='Generative AI (GenAI) is a type of artificial intelligence (AI) that can create new content and ideas based on existing data: \n\nHow it works'),
 Document(metadata={}, page_content="GenAI learns from existing data and uses it to generate new data with similar characteristics. For example, it can create text, images, videos, sounds, code, and 3D designs. \n\nHow it's used"),
 Document(metadata={}, page_content='GenAI is used in many industries, including software development, healthcare, finance, entertainment, and customer service. \nExamples')]

## Level 3: Document Specific Splitting

Our first two levels wouldn't work for pictures, PDF or code snippets.

The Markdown, Python, and JS splitters will basically be similar to Recursive Character, but with different separators.


Ref: [here](https://python.langchain.com/docs/how_to/code_splitter/)


### Markdown


Separators:
* `\n#{1,6}` - Split by new lines followed by a header (H1 through H6)
* ```` ```\n ```` - Code blocks
* `\n\\*\\*\\*+\n` - Horizontal Lines
* `\n---+\n` - Horizontal Lines
* `\n___+\n` - Horizontal Lines
* `\n\n` Double new lines
* `\n` - New line
* `" "` - Spaces
* `""` - Character

In [None]:
from langchain.text_splitter import MarkdownTextSplitter

In [None]:
splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)

In [None]:
markdown_text = """
# **Generative AI (GenAI)**

Generative AI, or **GenAI**, refers to a category of artificial intelligence technologies designed to generate new content. This can include text, images, music, videos, and even code. Unlike traditional AI, which focuses on recognizing patterns or classifying data, GenAI aims to **create** something new by learning from large datasets and then generating novel outputs that resemble the original data.

## **Key Features of Generative AI**

1. **Creative Outputs**:
   GenAI can produce original content such as:
   - Text (e.g., articles, stories)
   - Visual Art (e.g., paintings, digital designs)
   - Music (e.g., original compositions)
   - Code (e.g., software programs)
   - Deepfake technology (e.g., realistic images/videos of people)

2. **Learning from Data**:
   Generative AI models are trained on large datasets and learn patterns, structures, and other underlying features. Once trained, they can generate new content by applying what they've learned.

"""

In [None]:
splitter.create_documents([markdown_text])

[Document(metadata={}, page_content='# **Generative AI (GenAI)**'),
 Document(metadata={}, page_content='Generative AI, or **GenAI**, refers to'),
 Document(metadata={}, page_content='a category of artificial intelligence'),
 Document(metadata={}, page_content='technologies designed to generate new'),
 Document(metadata={}, page_content='content. This can include text, images,'),
 Document(metadata={}, page_content='music, videos, and even code. Unlike'),
 Document(metadata={}, page_content='traditional AI, which focuses on'),
 Document(metadata={}, page_content='recognizing patterns or classifying'),
 Document(metadata={}, page_content='data, GenAI aims to **create**'),
 Document(metadata={}, page_content='something new by learning from large'),
 Document(metadata={}, page_content='datasets and then generating novel'),
 Document(metadata={}, page_content='outputs that resemble the original'),
 Document(metadata={}, page_content='data.'),
 Document(metadata={}, page_content='## **Key F

### Python

Separaters
* `\nclass` - Classes first
* `\ndef` - Functions next
* `\n\tdef` - Indented functions
* `\n\n` - Double New lines
* `\n` - New Lines
* `" "` - Spaces
* `""` - Characters

In [None]:
from langchain.text_splitter import PythonCodeTextSplitter

In [None]:
python_text = """
import random

def tell_joke():
  jokes = ["Why don't skeletons fight each other? They don't have the guts!","Why don't programmers like nature? It has too many bugs."]
  joke = random.choice(jokes)
  return joke

print("Here's a fun joke for you:")
print(tell_joke())
"""

In [None]:
python_splitter = PythonCodeTextSplitter(chunk_size=200, chunk_overlap=0)

In [None]:
python_splitter.create_documents([python_text])

[Document(metadata={}, page_content='import random'),
 Document(metadata={}, page_content='def tell_joke():\n  jokes = ["Why don\'t skeletons fight each other? They don\'t have the guts!","Why don\'t programmers like nature? It has too many bugs."]\n  joke = random.choice(jokes)\n  return joke'),
 Document(metadata={}, page_content='print("Here\'s a fun joke for you:")\nprint(tell_joke())')]

Whole function stays in one document if chunk size if sufficient. then the rest of the code is in a second document.


### JS

Separators:
* `\nfunction` - Indicates the beginning of a function declaration
* `\nconst` - Used for declaring constant variables
* `\nlet` - Used for declaring block-scoped variables
* `\nvar` - Used for declaring a variable
* `\nclass` - Indicates the start of a class definition
* `\nif` - Indicates the beginning of an if statement
* `\nfor` - Used for for-loops
* `\nwhile` - Used for while-loops
* `\nswitch` - Used for switch statements
* `\ncase` - Used within switch statements
* `\ndefault` - Also used within switch statements
* `\n\n` - Indicates a larger separation in text or code
* `\n` - Separates lines of code or text
* `" "` - Separates words or tokens in the code
* `""` - Makes every character a separate element

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

In [None]:
javascript_text = """
let randomNumber = Math.floor(Math.random() * 100) + 1;
document.body.innerHTML = `<h1>Your random number is: ${randomNumber}</h1>`;

"""

In [None]:
js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=65, chunk_overlap=0
)

In [None]:
js_splitter.create_documents([javascript_text])

[Document(metadata={}, page_content='let randomNumber = Math.floor(Math.random() * 100) + 1;'),
 Document(metadata={}, page_content='document.body.innerHTML = `<h1>Your random number is:'),
 Document(metadata={}, page_content='${randomNumber}</h1>`;')]

## Level 4: Semantic Chunking


Semantic chunking involves breaking down text based on its meaning, as captured by embeddings. The idea is to identify points where the meaning of adjacent sentences differs significantly, indicating a transition to a new topic or section.

At a high level, the process begins by splitting the text into individual sentences. These sentences are then grouped into sets of three. For each set, the semantic similarity between consecutive sentences is evaluated using embeddings. If the embedding distance (which reflects how semantically similar two sentences are) exceeds a certain threshold, it signals a "break point" where the text should be split into a new chunk or section. This allows the text to be naturally divided based on shifts in meaning.

In [None]:
# Install Dependencies
!pip install --quiet langchain_experimental langchain_openai langchain-huggingface

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/209.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m204.8/209.2 kB[0m [31m12.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.2/209.2 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.6/411.6 kB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m454.3/454.3 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
with open("semantic_chunking.txt") as f:
    semantic_chunking = f.read()

In [None]:
# Create Text Splitter, for this we need an embedding model
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en",encode_kwargs={'normalize_embeddings':True})


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.3k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [None]:
text_splitter = SemanticChunker(embeddings)

In [None]:
docs = text_splitter.create_documents([semantic_chunking])
print(docs[0].page_content)

Practical Vedanta and other lectures
Practical Vedanta: Part I
Practical Vedanta: Part II
Practical Vedanta: Part III
Practical Vedanta: Part IV
The Way To The Realisation Of A Universal Religion
The Ideal Of A Universal Religion
The Open Secret
The Way To Blessedness
Yajnavalkya And Maitreyi
Soul, Nature, And God
Cosmology
A Study Of The Sankhya Philosophy
Sankhya And Vedanta
The Goal

Practical Vedanta: Part I
(Delivered in London, 10th November 1896)
I have been asked to say something about the practical position of the
Vedanta philosophy. As I have told you, theory is very good indeed, but how
are we to carry it into practice? If it be absolutely impracticable, no theory is
of any value whatever, except as intellectual gymnastics. The Vedanta,
therefore, as a religion must be intensely practical. We must be able to carry it
out in every part of our lives. And not only this, the fictitious differentiation
between religion and the life of the world must vanish, for the Vedanta teach

In [None]:
print(len(docs))

192


### Breakpoints

This chunker works by determining when to "break" apart sentences. This is done by looking for differences in embeddings between any two sentences. When that difference is past some threshold, then they are split.

There are a few ways to determine what that threshold is, which are controlled by the `breakpoint_threshold_type` kwarg.

---

**Note:** if the resulting chunk sizes are too small/big, the additional kwargs `breakpoint_threshold_amount` and `min_chunk_size` can be used for adjustments.

---

Types of breakpoint threshold
1. Percentile
2. Standard Deviation
3. Interquartile
4. Gradient

#### 1. Percentile as threshold

The default way to split is based on percentile. In this method, all differences between sentences are calculated, and then any difference greater than the X percentile is split. The default value for X is 95.0 and can be adjusted by the keyword argument breakpoint_threshold_amount which expects a number between 0.0 and 100.0.

In [None]:
text_splitter = SemanticChunker(
    embeddings, breakpoint_threshold_type="percentile",breakpoint_threshold_amount=98.0
)

In [None]:
docs = text_splitter.create_documents([semantic_chunking])
print(docs[0].page_content)

Practical Vedanta and other lectures
Practical Vedanta: Part I
Practical Vedanta: Part II
Practical Vedanta: Part III
Practical Vedanta: Part IV
The Way To The Realisation Of A Universal Religion
The Ideal Of A Universal Religion
The Open Secret
The Way To Blessedness
Yajnavalkya And Maitreyi
Soul, Nature, And God
Cosmology
A Study Of The Sankhya Philosophy
Sankhya And Vedanta
The Goal

Practical Vedanta: Part I
(Delivered in London, 10th November 1896)
I have been asked to say something about the practical position of the
Vedanta philosophy. As I have told you, theory is very good indeed, but how
are we to carry it into practice? If it be absolutely impracticable, no theory is
of any value whatever, except as intellectual gymnastics. The Vedanta,
therefore, as a religion must be intensely practical. We must be able to carry it
out in every part of our lives. And not only this, the fictitious differentiation
between religion and the life of the world must vanish, for the Vedanta teach

In [None]:
print(len(docs))

78


#### 2. Standard Deviation as threshold

In this method, any difference greater than X standard deviations is split. The default value for X is 3.0 and can be adjusted by the keyword argument breakpoint_threshold_amount.



In [None]:
text_splitter = SemanticChunker(
    embeddings, breakpoint_threshold_type="standard_deviation",breakpoint_threshold_amount=4.0
)

In [None]:
docs = text_splitter.create_documents([semantic_chunking])
print(docs[0].page_content)

Practical Vedanta and other lectures
Practical Vedanta: Part I
Practical Vedanta: Part II
Practical Vedanta: Part III
Practical Vedanta: Part IV
The Way To The Realisation Of A Universal Religion
The Ideal Of A Universal Religion
The Open Secret
The Way To Blessedness
Yajnavalkya And Maitreyi
Soul, Nature, And God
Cosmology
A Study Of The Sankhya Philosophy
Sankhya And Vedanta
The Goal

Practical Vedanta: Part I
(Delivered in London, 10th November 1896)
I have been asked to say something about the practical position of the
Vedanta philosophy. As I have told you, theory is very good indeed, but how
are we to carry it into practice? If it be absolutely impracticable, no theory is
of any value whatever, except as intellectual gymnastics. The Vedanta,
therefore, as a religion must be intensely practical. We must be able to carry it
out in every part of our lives. And not only this, the fictitious differentiation
between religion and the life of the world must vanish, for the Vedanta teach

In [None]:
print(len(docs))

42


#### 3. Interquartile as threshold

In this method, the interquartile distance is used to split chunks. The interquartile range can be scaled by the keyword argument breakpoint_threshold_amount, the default value is 1.5.

In [None]:
text_splitter = SemanticChunker(
    embeddings, breakpoint_threshold_type="interquartile",breakpoint_threshold_amount=2.0
)

In [None]:
docs = text_splitter.create_documents([semantic_chunking])
print(docs[0].page_content)

Practical Vedanta and other lectures
Practical Vedanta: Part I
Practical Vedanta: Part II
Practical Vedanta: Part III
Practical Vedanta: Part IV
The Way To The Realisation Of A Universal Religion
The Ideal Of A Universal Religion
The Open Secret
The Way To Blessedness
Yajnavalkya And Maitreyi
Soul, Nature, And God
Cosmology
A Study Of The Sankhya Philosophy
Sankhya And Vedanta
The Goal

Practical Vedanta: Part I
(Delivered in London, 10th November 1896)
I have been asked to say something about the practical position of the
Vedanta philosophy. As I have told you, theory is very good indeed, but how
are we to carry it into practice? If it be absolutely impracticable, no theory is
of any value whatever, except as intellectual gymnastics. The Vedanta,
therefore, as a religion must be intensely practical. We must be able to carry it
out in every part of our lives. And not only this, the fictitious differentiation
between religion and the life of the world must vanish, for the Vedanta teach

In [None]:
print(len(docs))

169


#### 4. Gradient as threshold

In this method, the gradient of distance is used to split chunks along with the percentile method. This method is useful when chunks are highly correlated with each other or specific to a domain e.g. legal or medical. The idea is to apply anomaly detection on gradient array so that the distribution become wider and easy to identify boundaries in highly semantic data. Similar to the percentile method, the split can be adjusted by the keyword argument breakpoint_threshold_amount which expects a number between 0.0 and 100.0, the default value is 95.0.

In [None]:
text_splitter = SemanticChunker(
    embeddings, breakpoint_threshold_type="gradient",breakpoint_threshold_amount=98.0
)

In [None]:
docs = text_splitter.create_documents([semantic_chunking])
print(docs[0].page_content)

Practical Vedanta and other lectures
Practical Vedanta: Part I
Practical Vedanta: Part II
Practical Vedanta: Part III
Practical Vedanta: Part IV
The Way To The Realisation Of A Universal Religion
The Ideal Of A Universal Religion
The Open Secret
The Way To Blessedness
Yajnavalkya And Maitreyi
Soul, Nature, And God
Cosmology
A Study Of The Sankhya Philosophy
Sankhya And Vedanta
The Goal

Practical Vedanta: Part I
(Delivered in London, 10th November 1896)
I have been asked to say something about the practical position of the
Vedanta philosophy.


In [None]:
print(len(docs))

78
