## What is this notebook about? 

This notebook gives an overview about the types of text splitting. One of the most effective strategies to enhance the performance of your language model applications is to divide large datasets into smaller, more manageable chunks. This process, known as splitting or chunking, can significantly improve efficiency. In the context of multi-modal models, this approach also extends to images, where dividing them into smaller sections can optimize processing.

## Types of Text Splitters

* **Type-1: [Character Splitting](#CharacterSplitting)** – Breaking text into basic character chunks.
* **Type-2: [Recursive Character Text Splitting](#RecursiveCharacterSplitting)** – Chunking text recursively using a set of separators.
* **Type-3: [Code-Specific Splitting](#CodeSpecific)** – Different chunking methods for specific programming languages (e.g., Markdown, Python, JavaScript).
* **Type-4: [Token Splitting](#TokenSplitting)** - Splitting on token count explicity.

## Required Packages

The following package(s) needs to be installed for the experiments:

- **Langchain**

## Setup

In [None]:
!pip install langchain

## Type-1: Character Splitting <a id="CharacterSplitting"></a>

Character splitting is the most basic way to break up text. It involves dividing the text into chunks of a fixed number of characters, regardless of the content or structure.
While this method isn't ideal for most applications, it's a good starting point to understand the basics of text splitting.

### Pros:
- Simple and easy to implement.

### Cons:
- Rigid; doesn't consider the structure or meaning of the text.

### Concepts to Know:
- **Chunk Size**: The number of characters in each chunk (e.g., 50, 100, 500, 1000).
- **Chunk Overlap**: The amount of overlap between consecutive chunks. This helps avoid splitting important context across chunks, though it may create some duplicate data.

In [4]:
# Sample Text
sample_text = "Hey! Welcome to the tutorial of text splitter. Feeling excited to play with these:)"

In [13]:
#import CharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size = 25, chunk_overlap=0, strip_whitespace=False)

In [14]:
text_splitter.create_documents(sample_text)

[Document(metadata={}, page_content='H'),
 Document(metadata={}, page_content='e'),
 Document(metadata={}, page_content='y'),
 Document(metadata={}, page_content='!'),
 Document(metadata={}, page_content=' '),
 Document(metadata={}, page_content='W'),
 Document(metadata={}, page_content='e'),
 Document(metadata={}, page_content='l'),
 Document(metadata={}, page_content='c'),
 Document(metadata={}, page_content='o'),
 Document(metadata={}, page_content='m'),
 Document(metadata={}, page_content='e'),
 Document(metadata={}, page_content=' '),
 Document(metadata={}, page_content='t'),
 Document(metadata={}, page_content='o'),
 Document(metadata={}, page_content=' '),
 Document(metadata={}, page_content='t'),
 Document(metadata={}, page_content='h'),
 Document(metadata={}, page_content='e'),
 Document(metadata={}, page_content=' '),
 Document(metadata={}, page_content='t'),
 Document(metadata={}, page_content='u'),
 Document(metadata={}, page_content='t'),
 Document(metadata={}, page_conten

In [15]:
text_splitter.create_documents([sample_text])

[Document(metadata={}, page_content='Hey! Welcome to the tutorial of text splitter. Feeling excited to play with these:)')]

In [28]:
# Let's reduce the chunck size along with the separator
text_splitter = CharacterTextSplitter(chunk_size = 16, chunk_overlap=0, separator='',strip_whitespace=False)

In [29]:
text_splitter.create_documents([sample_text])

[Document(metadata={}, page_content='Hey! Welcome to '),
 Document(metadata={}, page_content='the tutorial of '),
 Document(metadata={}, page_content='text splitter. F'),
 Document(metadata={}, page_content='eeling excited t'),
 Document(metadata={}, page_content='o play with thes'),
 Document(metadata={}, page_content='e:)')]

### Chunk Overlap

**Chunk overlap** ensures that consecutive chunks share part of their content. The tail of **Chunk #1** will overlap with the head of **Chunk #2**, and so on.

For example, with an overlap value of 5, the last 5 characters of each chunk will appear at the beginning of the next chunk.

In [34]:
# Let's switch on the chunk_overlap parameter and see
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=5, separator='')

In [35]:
text_splitter.create_documents([sample_text])

[Document(metadata={}, page_content='Hey! Welcome to the tutorial of tex'),
 Document(metadata={}, page_content='f text splitter. Feeling excited to'),
 Document(metadata={}, page_content='ed to play with these:)')]

### Separators

**Separators** are character(s) sequences you would like to split on.

Say you wanted to chunk your data at `te`, you can specify it.

In [36]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=5, separator='te')

In [37]:
text_splitter.create_documents([sample_text])

[Document(metadata={}, page_content='Hey! Welcome to the tutorial of'),
 Document(metadata={}, page_content='xt splitter. Feeling exci'),
 Document(metadata={}, page_content='d to play with these:)')]

## Type-2: Recursive Character Text Splitting <a id="RecursiveCharacterSplitting"></a>

The problem with type-1 is that, it don't take into account the structure of our document at all. We simply split by a fixed number of characters.

With [**Recursive Character Text Splitter**](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html), one can specify a series of separators, and the text will be split based on those, preserving the document's structure.

Default separators supported by langchain are:
* "\n\n" - Double new line, or most commonly paragraph breaks
* "\n" - New lines
* " " - Spaces
* "" - Characters

Note: period (".") isn't supported as of now. This may be included in future. Not sure🙃 

In [39]:
# Sample Text(Probably a bigger one)
sample_text2="""
A small bird, lost in a storm, found shelter in an old tree. The next morning, the sun peeked through the leaves, and the bird sang the sweetest song, thanking the tree for its kindness. From that day on, they both shared the sky, each offering the other comfort in silence.
"""

In [43]:
# import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Text Splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)

In [45]:
text_splitter.create_documents([sample_text2])

[Document(metadata={}, page_content='A small bird, lost in a storm, found shelter in an old tree. The'),
 Document(metadata={}, page_content='next morning, the sun peeked through the leaves, and the bird'),
 Document(metadata={}, page_content='sang the sweetest song, thanking the tree for its kindness. From'),
 Document(metadata={}, page_content='that day on, they both shared the sky, each offering the other'),
 Document(metadata={}, page_content='comfort in silence.')]

In [47]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 132, chunk_overlap=0)
text_splitter.create_documents([sample_text2])

[Document(metadata={}, page_content='A small bird, lost in a storm, found shelter in an old tree. The next morning, the sun peeked through the leaves, and the bird sang'),
 Document(metadata={}, page_content='the sweetest song, thanking the tree for its kindness. From that day on, they both shared the sky, each offering the other comfort'),
 Document(metadata={}, page_content='in silence.')]

- The RecursiveCharacterTextSplitter splits the text intelligently into smaller, more meaningful chunks (e.g., preserving sentence breaks), which makes it better suited for maintaining content structure in large texts.
-  This is because this splitter provides some flexibility, allowing your chunks to "snap" to the nearest separator, ensuring more natural breaks in the text.

## Type-3: Code Specific Splitting <a id="CodeSpecific"></a>

- The `RecursiveCharacterTextSplitter` includes pre-built lists of separators that are tailored to help split text according to common structures, such as those found in specific programming languages, ensuring logical and meaningful breaks in the text.
- Programming Languages supported by langchain are [Click here to know🤩](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.Language.html)
- Here are some of the separators:

        * `\n#{1,6}` - Split by new lines followed by a header (H1 through H6)
        * ```` ```\n ```` - Code blocks
        * `\n\\*\\*\\*+\n` - Horizontal Lines
        * `\n---+\n` - Horizontal Lines
        * `\n___+\n` - Horizontal Lines
        * `\n\n` Double new lines
        * `\n` - New line
        * `" "` - Spaces
        * `""` - Character

- ### Markdown Text

In [52]:
# Markdown Text

markdown_text="""
# Weekend Getaway to the Mountains

### Skiing

Hit the slopes early for the best snow conditions.

### Snowboarding

If you're into snowboarding, the trails here are perfect.

## Accommodation

Stay at the mountain lodge for a cozy experience with amazing views.

## Dining

### Breakfast

Don't miss the pancakes at the local diner.

### Dinner

The mountain grill offers delicious steaks and a warm fireplace.
"""

In [53]:
# import MarkdownTextSplitter
from langchain.text_splitter import MarkdownTextSplitter

#splitter
splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)

In [54]:
splitter.create_documents([markdown_text])

[Document(metadata={}, page_content='# Weekend Getaway to the Mountains'),
 Document(metadata={}, page_content='### Skiing'),
 Document(metadata={}, page_content='Hit the slopes early for the best snow'),
 Document(metadata={}, page_content='conditions.'),
 Document(metadata={}, page_content='### Snowboarding'),
 Document(metadata={}, page_content="If you're into snowboarding, the trails"),
 Document(metadata={}, page_content='here are perfect.'),
 Document(metadata={}, page_content='## Accommodation'),
 Document(metadata={}, page_content='Stay at the mountain lodge for a cozy'),
 Document(metadata={}, page_content='experience with amazing views.'),
 Document(metadata={}, page_content='## Dining\n\n### Breakfast'),
 Document(metadata={}, page_content="Don't miss the pancakes at the local"),
 Document(metadata={}, page_content='diner.'),
 Document(metadata={}, page_content='### Dinner'),
 Document(metadata={}, page_content='The mountain grill offers delicious'),
 Document(metadata={}, p

- Notice how the splits tend to follow the markdown sections. However, it's not always perfect. For example, there's a chunk with just the word "conditions.", "diner." in it. This can happen when the chunk size is too small.

In [56]:
splitter = MarkdownTextSplitter(chunk_size = 75, chunk_overlap=0)

splitter.create_documents([markdown_text])

[Document(metadata={}, page_content='# Weekend Getaway to the Mountains\n\n### Skiing'),
 Document(metadata={}, page_content='Hit the slopes early for the best snow conditions.\n\n### Snowboarding'),
 Document(metadata={}, page_content="If you're into snowboarding, the trails here are perfect."),
 Document(metadata={}, page_content='## Accommodation'),
 Document(metadata={}, page_content='Stay at the mountain lodge for a cozy experience with amazing views.'),
 Document(metadata={}, page_content="## Dining\n\n### Breakfast\n\nDon't miss the pancakes at the local diner."),
 Document(metadata={}, page_content='### Dinner'),
 Document(metadata={}, page_content='The mountain grill offers delicious steaks and a warm fireplace.')]

- See, now the chunck size has been increased to '75' and observe the output above ^^^

- ### Python

Below are the python splitters:

    * `\nclass` - Classes first
    * `\ndef` - Functions next
    * `\n\tdef` - Indented functions
    * `\n\n` - Double New lines
    * `\n` - New Lines
    * `" "` - Spaces
    * `""` - Characters

In [57]:
# Python Code

python_text = """
class Animal:
  def __init__(self, species, sound):
    self.species = species
    self.sound = sound

dog = Animal("Dog", "Woof")

for i in range(5):
    print(dog.sound)
"""

In [58]:
# import PythonCodeTextSplitter
from langchain.text_splitter import PythonCodeTextSplitter

#splitter
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)

In [59]:
python_splitter.create_documents([python_text])

[Document(metadata={}, page_content='class Animal:\n  def __init__(self, species, sound):\n    self.species = species'),
 Document(metadata={}, page_content='self.sound = sound'),
 Document(metadata={}, page_content='dog = Animal("Dog", "Woof")\n\nfor i in range(5):\n    print(dog.sound)')]

In [72]:
# Another way by RecursiveCharacterTextSplitter

from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([python_text])
python_docs

[Document(metadata={}, page_content='class Animal:'),
 Document(metadata={}, page_content='def __init__(self, species, sound):'),
 Document(metadata={}, page_content='self.species = species\n    self.sound = sound'),
 Document(metadata={}, page_content='dog = Animal("Dog", "Woof")'),
 Document(metadata={}, page_content='for i in range(5):\n    print(dog.sound)')]

- ### HTML

In [73]:
# Sample Text

html_text = """
<!DOCTYPE html>
<html>
    <head>
        <title>🌍 Simple Example</title>
        <style>
            body {
                font-family: Arial, sans-serif;
            }
            h1 {
                color: green;
            }
        </style>
    </head>
    <body>
        <div>
            <h1>🌍 Simple Example</h1>
            <p>This is a basic HTML structure with a simple heading and paragraph.</p>
        </div>
    </body>
</html>
"""

In [74]:
# splitter
html_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.HTML, chunk_size=60, chunk_overlap=0
)

html_docs = html_splitter.create_documents([html_text])
html_docs

[Document(metadata={}, page_content='<!DOCTYPE html>\n<html>'),
 Document(metadata={}, page_content='<head>\n        <title>🌍 Simple Example</title>'),
 Document(metadata={}, page_content='<style>\n            body {\n                font-family: Aria'),
 Document(metadata={}, page_content='l, sans-serif;\n            }\n            h1 {'),
 Document(metadata={}, page_content='color: green;\n            }\n        </style>\n    </head>'),
 Document(metadata={}, page_content='<body>'),
 Document(metadata={}, page_content='<div>\n            <h1>🌍 Simple Example</h1>'),
 Document(metadata={}, page_content='<p>This is a basic HTML structure with a simple heading and'),
 Document(metadata={}, page_content='paragraph.</p>\n        </div>\n    </body>\n</html>')]

- ### JavaScript

In [77]:
# Sample Text

javascript_text = """
function* indexGenerator() {
  let index = 0;
  while (true) {
    yield index++;
  }
}
const g = indexGenerator();
console.log(g.next().value);
console.log(g.next().value);
"""

In [78]:
# JS splitter

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60, chunk_overlap=0
)

js_docs = js_splitter.create_documents([javascript_text])
js_docs

[Document(metadata={}, page_content='function* indexGenerator() {\n  let index = 0;'),
 Document(metadata={}, page_content='while (true) {\n    yield index++;\n  }\n}'),
 Document(metadata={}, page_content='const g = indexGenerator();\nconsole.log(g.next().value);'),
 Document(metadata={}, page_content='console.log(g.next().value);')]

## Type-4: Token Splitting <a id="TokenSplitting"></a>

- `Token splitting` refers to breaking text into smaller units called tokens, which are the building blocks that language models (LLMs) use to process and understand text. A token can be a word, part of a word, or even punctuation, depending on how the text is tokenized.

#### Benefits of Token Splitting:
- **Efficient Processing**: Splitting ensures the text fits within the model's token limit.
- **Meaningful Chunks**: Splitting based on token count keeps each chunk coherent for processing.

In [81]:
from langchain.text_splitter import TokenTextSplitter

# Sample Text
long_text = "This is a long piece of text that we want to split based on token count."

# Split text into tokens (for example, 10 tokens per chunk)
token_splitter = TokenTextSplitter(chunk_size=2, chunk_overlap=0)

# Split the text into smaller chunks based on token count
chunks = token_splitter.create_documents([long_text])

# Output the chunks
for chunk in chunks:
    print(chunk.page_content)

This is
 a long
 piece of
 text that
 we want
 to split
 based on
 token count
.


## References and Resources

- [**Langchain Documentation**](https://python.langchain.com/docs/introduction/)

- [**Langchain**](https://github.com/langchain-ai/langchain)